Photonic decision-making for arbitrary-number-armed bandit problem utilizing parallel chaos generation

Jiafa Peng; Ning Jiang; Anke Zhao; Shiqin Liu; Yiqun Zhang; Kun Qiu; Qianwu Zhang

doi:10.1364/OE.432956

1. Introduction

Reinforcement learning is a field in machine learning that emphasizes how to take actions in response to changes in the environment to maximize the expected benefits [1–3]. The multi-armed bandit (MAB) problem is a classical problem in reinforcement learning [4,5]. It is extensively occurred in many areas, such as computer gaming [1,2], advertising recommendation [6], and communication channel selection [7], etc. In the MAB problem [8–12]: a player first selects one arm from a slot machine with K arms, then the slot machine yields a reward according to the reward probability of the selected arm (different arms have different reward probabilities). At the same time, the player learns from this cycle to improve his next-cycle selection. This procedure is repeated for a certain cycles and the goal of this player is to maximize the total benefits. However, up to present, most of the previously reported literatures were focused on solving the decision-making problem based on algorithms [13,14]. While solutions based on physical implementation, especially based on photonic technologies are lack of investigations, even though it has been confirmed that photonic technology has the potential to increase computing speed and reduce devices cost [15–21].

In recent years, laser chaos generated by external cavity semiconductor laser (ECSL) has been considered as a promising candidate for photonic implementation to solve the MAB problem, due to its wide bandwidth and excellent randomness features [22–26]. For instance, T. Mihana and coworkers proposed a scheme to solve the 2-armed bandit problem, in virtue of the lag chaos synchronization in mutually coupled semiconductor lasers [10]. S. Y. Xiang and colleagues demonstrated a solution to the 2²-armed bandit problem utilizing the coupled semiconductor lasers with a phase-modulated Sagnac loop, and a creative scheme adopting a three-laser-network structure to solve the 2³-armed bandit problem [11,12]. It has been confirmed that the 2ⁿ-armed bandit problem can be effectively solved by jointly utilizing the tug-of-war algorithm, the time division multiplexing technology and the laser chaos [8–12]. Nevertheless, in practical, there are many non-2ⁿ-armed MAB problems, and these 2ⁿ-armed MAB decision-making schemes are not efficient. It is a holdback for photonic decision-making applications. Therefore, it is valuable to explore universal approaches that support to solve any-number-armed MAB problem. Recently, T. Mihanna et al. proposed a laser network decision making scheme for MAB problems, in virtue of the lag synchronization in a ring configuration, and demonstrated that arbitrary-number-armed MAB problems can be solved by increasing the number of lasers [27].

From a different perspective, in this work, we propose and experimentally demonstrate a novel scheme that supports to solve arbitrary-number-armed bandit problem, on the basis of simultaneous low-correlation chaos generation and ɛ-greedy strategy. In Section 2, we experimentally demonstrate the generation of two amplitude-distribution-uniform random sequences for decision-making, on the basis of the simultaneous generation of two low-correlation wideband chaotic signals. Section 3 describes the principles of the proposed decision-making scheme. In Section 4, decision-making for an exemplary 5-armed bandit problem is demonstrated, and the CDR performance is discussed. Finally, a brief conclusion is given in Section 5.

2. Extraction of random sequences for decision-making

In this section, the simultaneous generation of two low-correlated chaotic signals and the corresponding extraction of two random sequences (A(t) and B(t)) for decision making are experimentally demonstrated.

2.1 Simultaneous generation of low-correlation chaotic signals

Figure 1(a) shows the experimental setup for the simultaneous generation of two low-correlation chaotic signals. The optical chaos generation system consists of a conventional ECSL, one self-feedback phase modulation loop (SFPML), and a filtering output module. An initial chaotic signal is generated by the ECSL, and then it is inputted into the SFPML where the chaotic signal is modulated by an electro-optic phase modulator (PM), and subsequently passed through a dispersion component (DC). After that, the signal is split into two parts, one is fed back, then photo-detected and used as the driving signal of the PM, while the other part is entered into the filtering module. In the experiment, the bias current of ECSL is 16.5 mA that is 1.5 times the threshold current and the central wavelength is 1551.33 nm, and the feedback delay time of the ECSL is 100.2 ns. The DC is constructed with a dispersion compensation fiber with a dispersion value of 638 ps/nm. With the spectrum expansion effect of the phase modulation and the phase-to-intensity conversion of DC in the SFPML, the spectrum of the initial chaotic signal can be significantly expanded and the TDS can be efficiently suppressed [22,24]. The PM modulation depth is 2.2. A VOA is used to control the injection power of the photodetector (30 GHz bandwidth). The maximum power gain of the RF amplifier is 38 dB with a bandwidth of 18 GHz. The feedback delay time of the SFPML is 24.3 ns. In the filtering output module, two optical tunable filters (OTF) with central wavelengths of 1551.1 nm and 1551.7 nm are adopted to generate two chaotic signals referred to as Output A and Output B. The 3-dB bandwidths of the OTFs are 0.2 nm. A 100 GS/s digital oscilloscope with four 25-GHz bandwidth channel is used to record and observe relevant electrical signal information. The optical spectra of the initial ECSL-generated chaos and the SFPML-output chaos, as well as those of Output A and Output B are shown in Fig. 1(b). It is indicated that the optical spectrum of the ECSL chaotic signal (dark line) is narrow. While after passing through the SFPML, the optical spectrum is broadened significantly (red line). After the parallel non-overlapping filtering (blue lines), two chaotic signals is simultaneously outputted. It is worth mentioning that the cross-correlation coefficient between Output A and Output B is as low as 0.14 (see Fig. 1(c)). Moreover, as shown in Fig. 1(d), with a properly large central wavelength gap between OTF1 and OTF2 (larger than 0.2 nm), the cross-correlation coefficients between output A and output B can always be maintained at a relatively low level (smaller than 0.15). Here the cross-correlation is calculated by the method defined in [28,29].

Fig. 1. (a) Experimental setup of simultaneous generation low-correlation chaotic signals. (b) Optical spectra of the initial ECSL-generated chaotic signal (dark line), the chaotic signal passing through SFPML (red line) and the two output chaotic signals (blue lines). (c) Cross correlation of Output A and Output B. (d) Influence of filtering wavelength on cross-correlation between Output A and Output B. ECSL, external cavity semiconductor laser; SFPML, self-feedback phase modulation loop; PM, electro-optic phase modulator; DC, dispersive component; VOA, variable optical attenuator; PD, photodetector; RF, radio-frequency amplifier; OTF, optical tunable filter.

Download Full Size | PDF

Figure 2 presents the temporal waveforms, the auto correlation function (ACF) and power spectra, for the initial ECSL-generated chaotic signal, and (Output A and Output B). Here the bandwidth is defined as the span between direct current component and the frequency that contains 80% of the energy in the power spectrum [30]. For the initial ECSL-generated chaotic signal (first column), the power is concentrated nearby the oscillation relaxation frequency, as such the bandwidth of the initial chaotic signal is only 5.82 GHz. On the other hand, there is an obvious time-delay signature (TDS) appearing at the ECSL feedback time (100.2 ns), which indicates that there is a periodicity in the chaotic signal [31]. While for the Output A and Output B, comparing with the initial chaotic signal, the spectra are much flatter and the bandwidths are enhanced to 13.52 GHz and 13.55 GHz, respectively. Moreover, the noise-like temporal waveforms and ACF curves with no TDS are observed. That is, both of the bandwidth and randomness of the output chaotic signals are enhanced simultaneously.

Fig. 2. Temporal waveforms (first row), ACF (second row) and power spectra (third row), for the ECSL-generated chaos and the Output A and Output B.

Download Full Size | PDF

2.2 Extraction of random sequences for decision-making

A 8-bit analog-to-digital conversion (ADC) with 4 reserved least significant bits (LSBs) is adopted to extract two random sequences A(t), B(t), (t = 1, 2, 3… 1000000) from Output A and Output B, in virtue of the traditional random number extraction method [32–34]. The sampling rate of ADC is 10GS/s. In Section 3, decision-making scheme are introduced, based on the values and the probability distributions characteristic of {A(t), B(t)}. As shown in Fig. 3(a) and 3(b), the probability distributions of A(t) and B(t) are uniform. That is, the randomness of A(t) and B(t) are excellent. Moreover, Fig. 3(c) shows the joint probability distribution of {A(t), B(t)}. It is indicated that each probability for {A(t), B(t)} = {X, Y} (X, Y = 0, 1, 2…15) is approximately 1/256. By calculating the cross-correlation function of A(t) and B(t), the results indicate that the cross-correlation coefficients are always lower than 0.03, this means the cross correlation between A(t) and B(t) is low.

Fig. 3. Probability distribution histograms of (a) A(t) and (b) B(t), as well as (c) the joint probability distribution of A(t) and B(t).

Download Full Size | PDF

3. Principles of the decision-making scheme

Firstly, a few related concepts involved in the problem is briefly introduced, before introducing the decision-making scheme to the multi-armed bandit problem [8,10,12]. It is assumed that the slot machine has K arms (namely Arm1, Arm2, …, ArmK) with different unknown reward probabilities (P₁, P₂, …, P_K), where K is a positive integer. The player makes M cycles of selection, and in each cycle he selects one of the K arms. Whether the player can get a reward in each cycle depends on the reward probability of the selected arm. For instance, assuming that the Armi (i=1, 2, …, K) is selected, the probability that the slot machine yields a reward is P_i, and the probability of yielding no reward is 1-P_i. The MAB problem can be interpreted as finding a decision-making strategy to maximize the total benefits, or referred to as how to find the best arm (the arm with the highest reward probability) within a certain cycles.

Figure 4 shows the mapping rule we proposed to solve the K-armed bandit problem. Here the mapping rule is designed on the basis of epsilon-greedy strategy [5]. The epsilon-greedy strategy is an easy-realized and extensively-adopted decision-making strategy, in which the player selects the current optimal arm with a probability of 1-ɛ or randomly select other K arms with a probability of ɛ/K. In the proposed scheme, a set {A(t), B(t)} consisting of 256 elements is firstly constructed, where the values of the elements A(t), B(t) are 0 to 15. Then, this set is divided into K+1 subsets, where each of the first K subsets contains a elements, while the last subset (K+1 subset) contains 256-K*a elements. The value of the elements in each subset satisfy the following relationship:

(1)$$\left\{ \begin{array}{l} \textrm{Subset 1}:\textrm{ }1 < = A(t )\ast 16 + B(t )+ 1 < = a,\textrm{ }\\ \textrm{Subset }2:\textrm{ }a + 1 < = A(t )\ast 16 + B(t )+ 1 < = 2a,\\ \textrm{ } \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\ldots \ldots \\ \textrm{Subset }K:\textrm{ }({K - 1} )a + 1 < = A(t )\ast 16 + B(t )+ 1 < = Ka,\\ \textrm{Subset }K + 1:\textrm{ }Ka\textrm{ + 1 < = }A(t )\ast 16 + B(t )\textrm{ + 1 < = 256}\textrm{. } \end{array} \right.$$

If the values of {A(t), B(t)} extracted from chaotic signals are in the range of the subset i (one of the first K subsets), the Armi is selected. Otherwise, they would be in the range of the (K+1)-th subset, and the current optimal arm which refers to as the arm with the largest average yield-reward after the previous selections is selected.

Fig. 4. Mapping rule of the proposed decision-making scheme.

Download Full Size | PDF

Figure 5 presents the overall flow of the decision-making process, the entire decision-making process contains M cycles. The player starts from the 1-st cycle and selects an arm according to the mapping rule in each cycle. (The best arm is selected after M cycles). The workflow of each cycle is introduced by taking the t-th cycle (t = 1, 2, …, M) as an example: Firstly, the random values of A(t) and B(t) for the t-th cycle are abstracted from the two low-correlation chaos signals. Then, the corresponding arm in the multi-armed slot machine is selected, according to the mapping rule. Finally, the average yield-reward of each arm is updated and the value of a is adjusted, according to some specific information, such as whether the slot machine feedback rewards, which arm has been selected and the number of cycles t that has been executed. This procedure is equivalent to updating a new mapping rule. The average yield-reward of each arm is updated as follows:

(2)$${V_i}(t) = \left\{ \begin{array}{l} \frac{{{V_i}(t - 1){N_i}(t - 1) + {R_i}(t)}}{{{N_i}(t - 1) + 1}},\;\; \textrm{if Arm}i\textrm{ is selected }\\ {V_i}(t - 1),\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \textrm{ Else} \end{array} \right.$$

(3)$${N_i}(t) = \left\{ \begin{array}{l} {N_i}(t - 1) + 1, \;\;\; \textrm{if Arm}i\textrm{ is selected }\\ {N_i}(t - 1),\,\textrm{ }\;\;\;\;\;\;\;\; \textrm{ Else} \end{array} \right.$$

Fig. 5. Schematic of the proposed decision-making process.

Download Full Size | PDF

Here, V_i (t) is the average yield-reward of Armi (i = 1, 2, 3…K) in the t-th cycle, N_i(t) is the total number of times that the Armi is selected in the previous t cycles, and R_i(t) represents the reward yielded by the slot machine, and its value is 0 or 1. Specifically, if the slot machine yields a reward, then R_i(t) = 1, otherwise R_i(t) = 0. In the experiment, the initial value of the number of selections for each arm N_i(0) = 1. To guarantee that only one arm is selected as the current optimal arm in each decision-making cycle, the initial values of the average reward V_i(0) are randomly set as floating-point numbers that are not equal to each other and in a narrow range approximate to 1, here it is set in the range 0.95<V_i(0) < 1.

4. Experimental results and discussion

In this section, several exemplary decision-making experiments aimed on different MAB problems are carried out to confirm the feasibility and investigate the performance of the proposed scheme, based on the extracted random sequences A(t), B(t) and the designed mapping rule. The statistically correct decision ratio (CDR) [6–10] is adopted to evaluate the performance of the proposed decision-making scheme. A CDR value above 0.9 means that the decision is correct, and the shorter the time for CDR converging to greater than 0.9, the higher the decision-making efficiency. The calculation for the CDR of the t-th cycle is performed by:

(4)$$\textrm{CDR}(t) = \frac{{\sum\limits_{j = 1}^L {\Delta \textrm{(}j\textrm{, }t\textrm{)}} }}{L}$$

In this equation, L represents the number of repeated trials in each cycle. For the j-th trials in the t-th cycle, if the best arm is selected, then Δ(j, t) = 1; if one of the other arms is selected, then Δ(j, t) = 0. It is worthy noticing that Eq. (4) also reflects the average probability of that the player selects the best arm in the t-cycle. In this work, L is set to 1000, and the number of cycles for each decision-making experiment is set to 200 (M = 200).

4.1 Decision-making for an exemplary 5-armed bandit problem

To confirm the feasibility and evaluate the performances of the proposed decision-making scheme, an exemplary non-2ⁿ-armed band problem, namely 5-arm bandit problem (K = 5), is taken as an example. The reward probabilities of the arms are set to 0.5, 0.4, 0.8, 0.3 and 0.6, respectively.

First, the influence of the experimental parameter a on the CDR value is analyzed. As shown in Fig. 6(a), the smaller the value of a, the higher the CDR value can be converged to. This is because that, as the value of a decreases, the probability of selecting the current optimal arm in each cycle will increase. After a certain number of selections, the current optimal arm becomes the best arm for the final decision. It is also found that the theoretical maximum value of CDR (CDR_high) can be evaluated by:

(5)$$\textrm{CD}{\textrm{R}_{\textrm{high}}} \approx \frac{{a + (256 - Ka)}}{{256}}$$

Equation (5) is derived as follows: If in the (t−1)-th cycle of all repeated experiments, the current optimal arm has been converged to the best arm (represented by Armi), then in the mapping rule of the t-th cycle, there are two subsets that corresponds to the best arm, namely, subset i and subset K+1. The probability that the values of A(t) and B(t) are in the range of these two subsets are about a/256 and (256-Ka)/256, respectively. Consequently, the probability of that the best arm is selected in the t-cycle is about a/256+(256-Ka)/256. Equation (5) indicates that a should be less than 6.4, in order to guarantee that the CDR can reach 0.9 in the demonstrated 5-armed bandit problem decision-making. Nevertheless, it is worth noting that when the smaller the value of a, the larger the number of selection cycles that CDR convergences to 0.9. In Fig. 6(b), when a is set to 1 (dark curve), the CDR can only reach 0.87 at t=200, but in fact it can reach 0.9 at t=381. In addition, by dynamically decreasing the value of a, the value of CDR can be quickly converged to 0.9. As shown by the brown curve in Fig. 6(b), where in the first 50 selection cycles, the value of a is monotonically decreased to 2 and then a is fixed at 1 since the 51-st selection cycle, the value of CDR is converged to 0.9 at t = 90 cycle. It is much faster than the case with a fixed a, although when t<40 the CDR values are smaller than those of the case with fixed a. This is because when the initial value of a is relatively large, the arm selection action is relatively random, as such the CDR values for the dynamic varying a scenario are smaller than those of the scenario with a fixed a. However, as the value of a gradually decreases, the selection probability of the current optimal arm increases correspondingly, and consequently, the correct decision can be made quickly. In the following discussions, the values of a are set in this way.

Fig. 6. (a) Evolution of CDR as a function of cycle number for the exemplary 5-armed bandit problem, by setting a = 10 (red curve), a = 20 (blue curve), a = 30 (dark curve), a = 40 (green curve), a = 50 (purple curve); (b) evolution of CDR for the cases with fixed a = 1 (dark curve) and dynamic varying a (a = 51, 50, 49, …, 3, 2 at t = 1, 2, 3, …, 49, 50, then a = 1 at t = 51, 52, 53, …, 199, 200) (brown curve).

Download Full Size | PDF

To more intuitively show the evolution of the decision-making scheme, Fig. 7 presents the average yield-reward curve of each arm in one trial of Fig. 6(b). It is observed that when t < 20, the average yield-reward value of each arm changes frequently. This is because in these cases, the value of a is large, and the probability of each arm being selected is similar. As the number of selection cycle increases and the value of a decreases, almost only Arm3 (red curve) and Arm5 (blue curve) keep changing frequently, which means that the player selects these two arms with a high reward probability to maximize the total benefits. When t >120, the average yield-reward value of Arm3 remains higher than those of the other arms, as thus Arm3 is always decided as the current optimal arm.

Fig. 7. Average yield-award of Arm1 (green curve), Arm2 (yellow curve), Arm3 (red curve), Arm4 (dark curve), Arm5 (green curve), versus the number of selection cycles (t) in one trial. The variation of a is same as that in the brown curve in Fig. 6(b).

Download Full Size | PDF

Figure 8 shows the CDR performance of the proposed decision-making scheme in the exemplary 5-armed bandit problem with different reward probabilities. It is demonstrated that, when only changing the position of the best arm (from Arm1 to Arm5), similar CDR trends can be obtained. This illustrates that some operations in the decision-making scheme, such as naming and numbering the arms, mapping the numbers and the fixed interval one by one, would not cause significant CDR evolution trend differences. Figure 8(b) indicates that the decision-making efficiency is closely related with the reward probability differences of the arms. When the reward probability of the best arm is significantly higher than those of the other arms (Fig. 8(b) blue curve), the CDR value can reach 0.9 more quickly than that of other cases.

Fig. 8. Evolution of CDR as a function of the selection cycle number t, for the cases with (a) similar reward probability (the best arm is Arm1 (red curve), Arm2 (blue curve), Arm3 (brown curve), Arm4 (green curve), Arm5 (dark curve)) and (b) different reward probability (Arm 1 is always the best arm).

Download Full Size | PDF

In general, the proposed scheme can achieve fast and correct decision-making in solving the multi-armed bandit problem.

4.2 Decision-making for other number-armed bandit problems

Comparing with the existing photonic decision-making scheme, the proposed photonic decision-making scheme is feasible to solve the non-2ⁿ-armed bandit problems. When switching from the 5-armed bandit problem to other arbitrary-number-armed bandit problems, by simply adjusting the parameters K and a in the mapping rule, the proposed decision-making scheme can show excellent decision-making performance. As shown in Fig. 9, the evolution curves of CDR for the 4-armed, 5-armed, 6-armed and 7-armed MAB problems are demonstrated. It is observed that these MAB problems can be solved successfully, and the CDR reaches 0.9 at the 58, 89, 104, and 133 cycles, respectively.

Fig. 9. Evolution of CDR as a function of the selection cycle number t, for the cases of (a) 4-armed bandit problem with P_i = {0.8, 0.4, 0.5, 0.3}, (b) 5-armed bandit problem with P_i = {0.8, 0.4, 0.5, 0.3, 0.6}, (c) 6-armed bandit problem with P_i = {0.8, 0.4, 0.5, 0.3, 0.6, 0.4}, and (d) 7-armed bandit problem with P_i= {0.8, 0.4, 0.5, 0.3, 0.6, 0.4, 0.5}.

Download Full Size | PDF

Furthermore, Fig. 10 presents the influence of the number of arms (K) on the number of cycle where the CDR reaches 0.9 for the first time (N_CDR=0.9) [27]. Here the number of arms ranges from 4 to 9, the reward probabilities of all ordinary arm (non-optimum arms) are set as 0.4, and the max number of cycles for each decision-making experiment is set as 1000 (M = 1000). It is indicated that, the larger the number of arms, the larger the N_CDR=0.9. Moreover, the overall variation trends of N_CDR=0.9 are approximately linear versus K, and the slope of the linear variation is closely related with the reward probability. When the reward probability of the best arm is significantly higher than those of the other arms (red curve), it is easier to make a correct decision, as such the slope of N_CDR=0.9 is smaller than those of other cases. This linear relationship between N_CDR=0.9 and K is beneficial to estimating the minimum number of cycles for finding the best arm.

Fig. 10. Number of cycle at CDR=0.9 versus the number of arms with P_i= {0.6, 0.4, …, 0.4} (blue curve), P_i = {0.7, 0.4, …, 0.4}, (dark curve), and P_i= {0.8, 0.4, …, 0.4} (red curve). The lines are the results of linear fitting.

Download Full Size | PDF

In addition, other K-armed bandit problems (especially those with large K values, such as K > 256) can be solved by expanding the space of random sequences in the mapping rule. Here the space of random sequences refers to the number of different values for the random sequence {A(t), B(t)}, and it is set as 256 in the abovementioned discussions. The potential approaches include reserving more significant bits when extracting A(t) and B(t) (e.g. reserving 5 or more least significant bits), or expanding the random sequence space to multiple (larger than 2) dimensions by extracting the multiple random number sequences on the basis of multiple-channel low-correlation chaos generation.

5. Conclusion

In conclusion, a novel decision-making scheme to solve arbitrary-number-armed bandit problem is proposed and experimentally demonstrated, on the basis of the parallel generation of two low-correlation chaotic signals and the epsilon-greedy strategy. Two low-correlation wideband TDS-suppressed chaotic signals that are originated from an ECSL are simultaneously generated by self-feedback phase modulation and parallel filtering. Based on this, two random sequences with uniform distributions are extracted from the simultaneously-generated chaotic signals utilizing an 8-bit ADC with 4-LSBs. With the random sequences, the mapping rules for arm selection are designed. We successfully perform the decision making in the exemplary cases of 4, 5, 6, 7-armed bandit problems with CDR>0.9. The experimental results show that fast decision making can be realized in the proposed scheme. This work presents an efficient scheme to solve arbitrary-armed bandit problems, which may pave the development of photonic decision-making.

Funding

Sichuan Province Science and Technology Support Program (2021JDJQ0023); National Natural Science Foundation of China (61671119); Science and Technology Commission of Shanghai Municipality (SKLSFO2020-05); Fundamental Research Funds for the Central Universities (ZYGX2019J003).

Disclosures

The authors declare no conflicts of interest.

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

1. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Bham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuolu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature 529(7587), 484–489 (2016). [CrossRef]

2. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Van Den Deiessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature 550(7676), 354–359 (2017). [CrossRef]

3. R. S. Sutton and A. G. Barto, Reinforcement Learning (MIT University, 1998).

4. H. Robbins, “Some aspects of the sequential design of experiments,” Bull. Amer. Math. Soc. 58(5), 527–536 (1952). [CrossRef]

5. A. Potapov and M. Ali, “Learning, exploration and chaotic policies,” Int. J. Mod. Phys. C 11(07), 1455–1464 (2000). [CrossRef]

6. M. Naruse, T. Matsubara, C. Nicolas, K. Kanno, T. Yang, and A. Uchida, “Generative adversarial network based on chaotic time series,” Sci. Rep. 9(1), 12963 (2019). [CrossRef]

7. S. Takeuchi, M. Hasegawa, K. Kanno, A. Uchida, N. Chauvet, and M. Naruse, “Dynamic channel selection in wireless communications via a multi-armed bandit algorithm using laser chaos time series,” Sci. Rep. 10(1), 1574 (2020). [CrossRef]

8. M. Naruse, N. Chauvet, A. Uchida, A. Drezet, G. bachelier, S. Huant, and H. Hori, “Decision making photonics: solving bandit problems using photons,” IEEE J. Select. Topics Quantum Electron. 26(1), 1–10 (2020). [CrossRef]

9. M. Naruse, Y. Terashima, A. Uchida, and S. Kim, “Ultrafast photonic reinforcement learning based on laser chaos,” Sci. Rep. 7(1), 8772 (2017). [CrossRef]

10. T. Mihana, Y. Mitsui, M. Takabayashi, K. Kanno, S. Sunada, M. Naruse, and A. Uchida, “Decision making for the multi-armed bandit problem using lag synchronization of chaos in mutually coupled semiconductor lasers,” Opt. Express 27(19), 26989–27008 (2019). [CrossRef]

11. Y. Ma, S. Y. Xiang, X. X. Guo, Z. Song, A. J. Wen, and Y. Hao, “Time-delay signature concealment of chaos and ultrafast decision making in mutually coupled semiconductor lasers with a phase-modulated Sagnac loop,” Opt. Express 28(2), 1665–1678 (2020). [CrossRef]

12. Y. Han, S. Y. Xiang, Y. Wang, Y. Ma, B. Wang, A. J. Wen, and Y. Hao, “Generation of multi-channel chaotic signals with time signature concealment and ultrafast photonic decision making based on globally-coupled semiconductor lasers network,” Photonics Res. 8(11), 1792–1799 (2020). [CrossRef]

13. K. Morihiro, N. Matsui, and H. Nishimura, “Chaotic exploration effects on reinforcement learning in shortcut maze task,” Int. J. Bifurcation Chaos Appl. Sci. Eng. 16(10), 3015–3022 (2006). [CrossRef]

14. N. Daw, J. O’Doherty, P. Dayan, B. Seymour, and R. Dolan, “Cortical substrates for exploratory decisions in humans,” Nature 441(7095), 876–879 (2006). [CrossRef]

15. J. Bueno, S. Maktoobi, L. Froehly, I. Fischer, M. Jacquot, L. Larger, and D. Brunner, “Reinforcement learning in a large-scale photonic recurrent neural network,” Optica 5(6), 756–760 (2018). [CrossRef]

16. N. Chauvet, D. Jegouso, B. Boulanger, H. Saigo, K. Okamura, H. Hori, A. Drezet, S. Huant, G. Bachelier, and M. Naruse, “Entangled-photo decision maker,” Sci. Rep. 9(1), 12229 (2019). [CrossRef]

17. S. D. Smith, “Optical bistability, photonic logic, and optical computation,” Appl. Opt. 25(10), 1550–1564 (1986). [CrossRef]

18. S. Y. Xiang, Z. Ren, Z. Song, Y. Zhang, X. X. Guo, G. Q. Han, and Y. Hao, “Computing primitive of fully-VCSELs-based all-optical spiking neural network for supervised learning and pattern classification,” IEEE Trans. Neural Netw. Learning Syst. 32(6), 2494–2505 (2021). [CrossRef]

19. J. Feldmann, N. Youngblood, C. D. Wright, H. Bhaskaran, and W. H. P. Pernice, “All-optical spiking neurosynaptic networks with self-learning capabilities,” Nature 569(7755), 208–214 (2019). [CrossRef]

20. S. Y. Xiang, Y. Zhang, J. Gong, X. X. Guo, L. Lin, and Y. Hao, “STDP-based unsupervised spike pattern learning in a photonic spiking neural network with VCSELs and VCSOAs,” IEEE J. Select. Topics Quantum Electron. 25(6), 1–9 (2019). [CrossRef]

21. Q. Cai, Y. Guo, P. Li, A. Bogris, K. A. Shore, Y. Zhang, and Y. C. Wang, “Modulation format identification in fiber communications using single dynamical node-based photonic reservoir computing,” Photonics Res. 9(1), B1–B8 (2021). [CrossRef]

22. N. Jiang, A. K. Zhao, S. Q. Liu, C. P. Xue, B. Y. Wang, and K. Qiu, “Generation of broadband chaos with perfect time delay signature suppression by using self-phase-modulated feedback and a microsphere resonator,” Opt. Lett. 43(21), 5359–5362 (2018). [CrossRef]

23. A. K. Zhao, N. Jiang, S. Q. Liu, C. P. Xue, and K. Qiu, “Wideband time delay signature-suppressed chaos generation using self-phase-modulated feedback semiconductor laser cascaded with dispersive component,” J. Lightwave Technol. 37(19), 5132–5139 (2019). [CrossRef]

24. A. K. Zhao, N. Jiang, C. C. Chang, Y. J. Wang, S. Q. Liu, and K. Qiu, “Generation and synchronization of wideband chaos in semiconductor lasers subject to constant-amplitude self-phase-modulated optical injection,” Opt. Express 28(9), 13292–13298 (2020). [CrossRef]

25. N. Jiang, C. Wang, C. P. Xue, G. Li, S. Q. Liu, and K. Qiu, “Generation of flat wideband chaos with suppressed time delay signature by using optical time lens,” Opt. Express 25(13), 14359–14367 (2017). [CrossRef]

26. Y. Hong, X. Chen, P. S. Spencer, and K. A. Shore, “Enhanced flat broadband optical chaos using low-cost VCSEL and fiber ring resonator,” IEEE J. Quantum Electron. 51(3), 1–6 (2015). [CrossRef]

27. T. Mihana, K. Fujii, Y. Mitsui, K. Kanno, M. Naruse, and A. Uchida, “Laser network decision making by lag synchronization of chaos in a ring configuration,” Opt. Express 28(26), 40112–40130 (2020). [CrossRef]

28. X. Porte, O. D’huys, T. Jüngling, X. Porte, D. Brunner, M. C. Soriano, and I. Fischer, “Autocorrelation properties of chaotic delay dynamical systems: a study on semiconductor lasers,” Phys. Rev. E 90(5), 052911 (2014). [CrossRef]

29. D. Rontani, A. Locquet, M. Sciamanna, D. S. Citrin, and S. Ortin, “Time-delay identification in a chaotic semiconductor laser with optical feedback: a dynamical point of view,” IEEE J. Quantum Electron. 45(7), 879–1891 (2009). [CrossRef]

30. Y. Hong and S. K. Ji, “Effect of digital acquisition on the complexity of chaos,” Opt. Lett. 42(13), 2507–2510 (2017). [CrossRef]

31. M. Cheng, X. Gao, L. Deng, L. Liu, Y. Deng, S. Fu, M. Zhang, and D. Liu, “Time-delay concealment in a three-dimensional electro-optic chaos system,” IEEE Photonics Technol. Lett. 27(9), 1030–1033 (2015). [CrossRef]

32. I. Reidler, Y. Aviad, and M. Rosenbluh, “Ultrahigh-speed random number generation based on a chaotic semiconductor laser,” Phys. Rev. Lett. 103(2), 024102 (2009). [CrossRef]

33. K. Hirano, T. Yamazaki, and S. Morikatsu, “Fast random bit generation with bandwidth enhanced chaos in semiconductor lasers,” Opt. Express 18(6), 5512–5524 (2010). [CrossRef]

34. N. Jiang, Y. J. Wang, A. K. Zhao, S. Q. Liu, L. Chen, B. Li, and K. Qiu, “Simultaneous bandwidth-enhanced and time delay signature-suppressed chaos generation in semiconductor laser subject to feedback from parallel coupling ring resonators,” Opt. Express 28(2), 1999–2009 (2020). [CrossRef]

Photonic decision-making for arbitrary-number-armed bandit problem utilizing parallel chaos generation

Abstract

1. Introduction

2. Extraction of random sequences for decision-making

2.1 Simultaneous generation of low-correlation chaotic signals

2.2 Extraction of random sequences for decision-making

3. Principles of the decision-making scheme

4. Experimental results and discussion

4.1 Decision-making for an exemplary 5-armed bandit problem

4.2 Decision-making for other number-armed bandit problems

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Equations (5)

Optics Express