Working memory load recognition with deep learning time series classification

Richong Pang; Richong Pang; Haojun Sang; Li Yi; Chenyang Gao; Hongkai Xu; Hongkai Xu; Yanzhao Wei; Yanzhao Wei; Yanzhao Wei; Lei Zhang; Lei Zhang; Jinyan Sun; Jinyan Sun

doi:10.1364/BOE.516063

1. Introduction

Brain-computer interface (BCI) enables the real-time information exchange between the human brain and machine [1]. BCI can transfer the brain signals into output by algorithm and trigger an operation with an instruction or enable the user to control body movement bypass the locomotor system through external devices. So far, the BCI technique has been applied widely in monitoring users’ cognitive function [2] and emotional state [3,4] and in instructing rehabilitation of patients with impaired or even loss of motor function [5–7]. Cognitive monitoring is to infer the user's cognitive state through the analysis of brain signals. The user's cognitive load and working memory load (WML) have a significant impact on these activities in areas where safety is critical.

Working memory (WM) is a short-term memory system that maintains and manipulates information momentarily. During WM, information is rapidly extracted and renewed. Sufficient WM lays the foundation for the success of cognitive tasks [8]. In passive BCI, WM load constitutes an essential signal source for monitoring applications to ensure a reliable decision-making process. The prefrontal cortex (PFC) is the central executive system of the brain and plays an important role in WM [9–11]. The N-back and Sternberg tasks are two classic research paradigms of WM. Compared with the N-back task, the Sternberg task is less influenced by practice, allows the separation of encoding, maintenance, and retrieval phases [12], and is closer to actual WM performance. Thus, in the present study, we applied the Sternberg task to build a method suitable for real-time monitoring of WML in BCI.

For now, most studies on WML classification with fNIRS apply machine learning (ML) based on statistics features of signals [13], and SVM and LDA are the two most applied algorithms [14]. Dong et al. used the kurtosis, skewness, and peak of hemodynamic signals to classify the three WMLs with SVM, and achieved a mean accuracy of 74% [15]. De et al. obtained the mean, skewness, and kurtosis of blood oxyhemoglobin and deoxyhemoglobin signals of PFC, mean blood oxygenation and total blood volume. After feature selection, LDA could classify the three types of WMLs of healthy subjects with an accuracy of 88.9% [16].

Although ML can recognize WMLs with encouraging results, it depends on the extraction and selection of features. This presents a high demand for researchers regarding data comprehension. Due to individual differences, WML decoding needs to vary from person to person, that is, personalized decoders. The personalized decoder has the potential to improve recognition accuracy but limits its generalization performance, which comes as a major flaw for BCI applications. To overcome this disadvantage, researchers have begun to use DL to decode WML across individuals [17,18]. DL, with strong data fitting ability, can extract abstract features self-adaptively, which means the inference of new features from finite messages contained in the training set. In addition, DL can also resist noise to a certain degree [19]. Wang et al. proposed a DL method that combined the Bi-Directional Gated Recurrent Unit (BiGRU) with attention mechanism and self-monitoring label augment (SLA) and classified different levels of WML across participants on fNIRS signals. Their DL method could recognize different WMLs with an accuracy of 77%, outperforming the conventional ML [18]. In another study, Asgher et al. applied the recurrent neural network with long short-term memory (LSTM) to decode four types of WMLs based on the PFC fNIRS signals in 15 healthy participants. The accuracy arrived at 89.31% [17].

Time series classification (TSC) turns up critical and challenging in data mining, and DL- TSC has also become a research hotspot. DL-TSC has achieved significant applications in electronic health records [20], human activity recognition [21], network security monitoring [22], and biomedical classification [13]. FNIRS signals have typical time series characteristics and contain intrinsic, incomprehensible, and abstract features [23]. DL-TSC can capture hidden patterns in time series effectively. DL-TSC treats fNIRS signals as a multidimensional time series and inputs into the model directly without complicated processing. Compared to other DL approaches, DL-TSC works better in decoding WML end-to-end [24], Thus, we adopted the DL-TSC method to decode WML.

At present, most of the DL-TSC methods use single-scale one-dimensional convolutional neural networks for feature extraction. However, when using convolutional neural networks for feature extraction, they are often limited by the receptive field of the convolutional neural networks, and the single-scale convolutional neural networks are relatively restricted in extracting time features [25]. Additionally, as a typical nonstationary neurophysiological signal, the hemodynamics signal of the cerebral cortex is controlled by multiple time scales in the neurophysiological system [26]. If only the single-scale one-dimensional convolutional neural networks are used for feature extraction, the extracted features may not be comprehensive enough. When using multi-scale convolution without considering multi-scale feature fusion, it will cause a sharp increase in model parameters and make the model too complex. To solve these problems, we proposed a DL-TSC model that integrates multi-scale convolutional neural network and multi-scale feature attention mechanism and used it to fNIRS decoding.

FNIRS was used to obtain the PFC oxygen signals in the Sternberg task. After fNIRS data preprocessing, three levels of WML were classified by two measures. First, the statistical features of fNIRS signals were retrieved and selected by LASSO (Least absolute shrinkage and selection operator) regression, and WMLs were decoded individually by SVM and LDA based on these features. Second, to develop a method suitable for real-time monitoring of WML in BCI, we proposed a time series classification architecture decoding WML and compared it with the other popular DL-TSC methods to Validate its performance.

2. Materials and methods

2.1 Participants and paradigm

Twenty-seven volunteers (age range: 19∼26 years; average age: 22.6 years; 13 women and 14 men) were recruited from Huazhong University of Science and Technology to participate in this study. All participants were healthy, right-handed, and had no history of neurological or psychiatric illnesses in either their personal or family histories. They spoke Chinese as their mother tongue and English as a second language.

The Sternberg task was used and had three different levels of WMLs: WML-4 with 4 letters as the Target, WML-6 with 6 letters as the Target, and WML-6-D with 4 letters as the Target and two letters as distraction. Each trial started with a 0.5-s cue, then showed a 1-s target, and finally showed a 1.5-s single letter test after a 3-s delay. Participants were instructed to memorize the Target and compare the test letter with the Target (Fig. 1). They pressed the left arrow key on the keyboard if the test letter was part of the Target set; otherwise, they pressed the right arrow key. There were 24 trials for each WML. The trials were presented pseudo-randomly, with an interval of 10∼12 seconds between two adjacent trials. All letters were capital English letters, and the distraction letters were shown in green. The experiment included two blocks, each block contained 36 trials, and there was a 30-s rest between blocks.

Fig. 1. The experimental procedure, showing trial examples for each WML.

Download Full Size | PDF

Before the official experiment, each participant received a thorough explanation of the experimental procedures. Prior to the experiment, each participant signed an informed consent form. The Human Subjects Institutional Review Board of Huazhong University of Science and Technology gave its approval for this experiment.

2.2 FNIRS data acquisition

FNIRS Data were collected using a continuous wave fNIRS system that was developed by the Britton Chance Center for Biomedical Photonics [27]. The system measured near-infrared light intensity changes at two wavelengths (785 nm and 850 nm) and used the modified Beer–Lambert law to calculate changes in the concentrations of oxyhemoglobin (Δ[HbO2]) and deoxyhemoglobin (Δ[Hb]). The system contained 32 detection channels (16 channels for each hemisphere), and 4 short separation channels (2 channels on each side), as seen in Fig. 2. The separation between the light source and the detector was 3 cm, while for the short-distance channel, it was 1 cm. The NIRS probes were positioned according to the location of the F3 and F4 EEG electrodes and covered the primary PFC region, as shown in Fig. 2. The sampling frequency for fNIRS data was 50 Hz.

Fig. 2. FNIRS system probe distribution. (a) The schematic of fNIRS channel locations on the head. The red dot represents the light source, the yellow square represents the detector of the standard channel, and the green square represents the detector of the short separation channel. (b) The enlarged fNIRS channels. The number is the NIRS channel.

Download Full Size | PDF

2.3 FNIRS data preprocessing

The raw light intensity collected by the fNIRS device was first converted to the optical density change (ΔOD). We performed 6-level wavelet decomposition on the fNIRS signal using a one-dimensional continuous wavelet transform whose mother wavelet is “db5”. We set the wavelet coefficients that did not conform to a Gaussian distribution to 0 and then reconstructed the signal. Wavelet coefficients with a probability of less than 0.1 were considered outliers that did not conform to a Gaussian distribution [28]. After the motion artifacts in ΔOD were removed by wavelet filtering, 0.01∼3.0 Hz bandpass filtering was performed on ΔOD to remove instrument noise. Then, ΔOD was converted to Δ[HbO2] and Δ[Hb] according to the modified Beer–Lambert law. The differential pathlength factor values for 785 nm and 850 nm were set as 6.0 and 5.2 [29], respectively. Further, a third-order polynomial fitting was applied to the hemodynamic signals to remove the drift after a 0.01∼1.25 Hz band-pass filter [30]. The General Linear Model (GLM) regression method of short separation channels was used to suppress superficial interference in hemodynamic signals. The GLM regression of short separation channels is an effective method to reduce superficial interference [26]. This method treats the signal from the detection channels ($\mathbf{Y}$) as a linear combination of the design matrix ${\mathbf{X}_{\mathbf{task}}}$ and the signal from the short separation channel (${\mathbf{X}_{\mathbf{short}}}$) (Eq. (1)). Channels 35 and 33 were selected as short separation channels for the left and right hemispheres, respectively. The GLM evaluates how much ${\mathbf{X}_{\mathbf{task}}}$ and ${\mathbf{X}_{\mathbf{short}}}$ contribute to the signal at the detection channel and the regression coefficient is solved using the least squares estimate (OLS). Then, the superficial interference can be eliminated by subtracting the contribution of ${\mathbf{X}_{\mathbf{short}}}$ from $\mathbf{Y}$. Finally, the hemodynamic signals were band-pass filtered at 0.01∼0.5 Hz [31] to reduce the interference of the Meyer wave, and then used for subsequent feature extraction and modeling.

(1)$$\mathbf{Y = [}{\mathbf{X}_{\mathbf{task}}}{\mathbf{X}_{\mathbf{short}}}\mathbf{]}[{_{{\mathbf{\beta }_{\mathbf{short}}}}^{{\mathbf{\beta }_{\mathbf{task}}}}} ]\mathbf{+ \varepsilon }$$

Among them, ${\mathbf{\beta }_{\mathbf{task}}}$ and ${\mathbf{\beta }_{\mathbf{short}}}$ are the activity strength of ${\mathbf{X}_{\mathbf{task}}}$ and ${\mathbf{X}_{\mathbf{short}}}$ respectively, ${\mathbf{X}_{\mathbf{task}}}$=[X_C X_HRF], X_C is a constant matrix, X_HRF is the expected blood oxygen response corresponding to the VWM task which can be obtained by convolving the hemodynamic response function (HRF) with the box function, the box function is a binary function with a value of 1 during stimulus presentation and 0 at other times, the HRF in this study is SPM canonical HRF [32] and $\varepsilon$ is the error term.

The main preprocessing of fNIRS data is presented in Fig. 3. The signal-to-noise ratio of Δ[HbO2] is higher than Δ[Hb] [33], and a relatively robust signal can reduce the noise processing and improve the recognition performance of a BCI system. Therefore, we only analyzed Δ[HbO2] in this study.

Fig. 3. Data preprocessing process.

Download Full Size | PDF

2.4 Feature extraction

Given the 1∼2 s delay in the hemodynamic responses to the WM tasks [34], the 9s data with 1.5 s after the cue as beginning were used for feature extraction. Apart from the statistical indexes such as mean, standard deviation, variance, peak, slope, skewness, and kurtosis [35,36], the sample entropy, power, area characteristics, mean square error, and mean of squared SM (SM=Δ[HbO2]-Δ[Hb]) were also used as features. As normalization facilitates the model training, all features were normalized according to Eq. (2),

(2)$${X^{\prime}} = \frac{{X - \min (X)}}{{\max (X) - \min (X)}}$$

where X and X’ are original and normalized feature values, respectively.

2.5 Feature selection and ML classification

The feature extraction raised a 72 × 352 data set for every subject, then the LASSO regression was conducted to retract some redundant features to improve the recognition accuracy. LASSO made some features shrink to 0 by punishing the regression coefficient. During feature selection, features with nonzero coefficients after the LASSO contraction process were retained as candidate features [37]. SVM and LDA were established to recognize the three WMLs based on 5the alternative features. The parameters of C and kernel function in SVM were set to 1 and “linear”, and the leave-One-Out cross-validation (LOOCV) was used to verify the models. To obtain a model with better performance, we traversed all feature subsets of the candidate features in the LOOCV and chose the subset with the highest recognition accuracy as the final classification feature. The precision, accuracy, recall, and Kappa coefficients were used to evaluate the generalization performance of the models.

2.6 Data augmentation

To classify different WMLs with DL, data augmentation was applied to increase the data size, which means generating artificial data from actual data. Data augmentation brings about a great deal of new trials, different from but relative to the original ones. Being derived from real data, these trails possess a similar time structure to real trails [38]. In this work, an analogy-based artificial trial generation was conducted to augment data [39], which produced artificial data according to the similarity of data within a class through the following procedures:

1. Calculating covariance matrix C of all the available data within a class, $(3)$$C = \frac{{\sum\limits_{i = 1}^m {{C_i}} }}{m} $$$ where C_i is the covariance matrix of every sample and m is the sample number.
2. Calculating the eigenvector V of covariance matrix C, which is the principal component (PC) of the data [40], $(4)$$C = VD{V^T}$$$ where D is the diagonal matrix of eigenvalues of C.
3. Selecting three samples randomly: X₁, X₂, and X₃.
4. Projecting the first two samples to PC, which means calculating signal power for X₁V and X₂V along each V_i (the column i in V).
5. Calculating the transition matrix Q according to (Eq. (5)), where diag(P) indicates the diagonal matrix based on the elements of vector P, P is the power of signal. $(5)$$Q = Vdiag(P_1^{( - 1/2)})\textrm{d}iag(P_2^{(1/2)}){V^T}$$$
6. Calculating new artificial data X^new according to (Eq. (6)). $(6)$${X^{new}} = {X_3}Q$$$

The new artificial data were produced based on the power similarity of the two samples in the same class. The PC of noise varied more than the PC of the signal, giving birth to new data with different noise levels.

2.7 Deep learning classification method

2.7.1 Deep learning time series classification

Definition 1: a time series signal X is expressed as: $X = [{x_1},{x_2},\ldots {x_t},{x_N}]$. ${x_t},t \in [1,M]$ means the nth sampling and M is the sampling number of X. ${x_t}$ may be a one-dimensional or multi-dimensional variable determined by the channel number of sensors.

Definition 2: a data set of time series D is expressed as $D = \{ ({X_1},{Y_1}),({X_2},{Y_2}),\ldots, ({X_N},{Y_N})\}$, where $({X_i},{Y_i}),i \in [1,N]$ means the ith time series with a one-hot label. When the dataset D includes K classes, the one-hot label Y_i is a vector with K elements. If the class index of X_i is k, the element the element k equals 1 while others equal 0.

TSC aims to train a model upon the data set D which projects time series data space to corresponding one-hot label space. FNIRS signals are typical multi-channel time series signals. In this work, fully convolution network (FCN), Bi-directional long short-term memory (BiLSTM), Rsenet, and Inception, four DL-TSC methods, were employed to recognize WMLs on fNIRS signals. We introduced a multiscale temporal attention mechanism into Rsenet (TAResnet). Moreover, we proposed a TAResnet-BiLSTM. The FCN, Resnet, Inception, and LSTM-Inception used in the paper are consistent with the research conducted by Ma et al. for f NIRS time series decoding [19].The BiLSTM model used in this study consists of three BiSLTM layers, with the number of hidden neurons in each BiLSTM layer being 90, 90, and 64, respectively.

2.7.2 TAResnet-BiLSTM

The TAResnet-BiLSTM proposed in this study effectively integrates the multiscale temporal features and global temporal dependencies in fNIRS signals, overcoming the problem of insufficient feature extraction in single-scale convolutional neural networks due to the limitation of receptive fields. What’s more, this study used a multiscale feature attention mechanism in convolutional neural networks for weighted fusion of multiscale temporal features, which greatly reduces model parameters and makes the model simpler and easier to train. As shown in Fig. 4, The TARestnet-BiLSTM consisted of a Resnet with a multi-scale temporal attention mechanism and one BiLSTM layer. The features extracted from the convolution block were connected with the features extracted from BiLSTM layer. Then, the features were summarized by a dense layer and were finally classified by a Softmax layer. The input of TAResnet-BiLSTM is a multivariate time series with a dimension of 32. The specific implementation details of this TAResnet-BiLSTM are shown in Table 1.

Fig. 4. The network structure of TARestnet-BiLSTM.

Download Full Size | PDF

Table 1. Hyperparameters of ATResnet-BiLSTM

View Table | View all tables in this article

2.7.2.1. TAResnet:

ATRestnet is a residual network consisting of 9 multi-scale convolutional (ATCONV) layers with multi-scale temporal attention mechanism. Each ATCONV layer consists of three one-dimensional convolutional layers with different kernel sizes ([K1,K2,K3]). The features extracted by three one-dimensional convolutional layers with different kernel size in the ATCONV layer are weighted and fused through a multi-scale temporal attention mechanism. In TAResnet, a shortcut connection is added between every three continuous convolutional layers to form a residual block. The convolution kernel sizes of the ATCONV layer in three consecutive residual blocks of ATResnet are [8,28,41], [4,8,41], and [2,4,8]. To achieve multi-level feature fusion in ATResnet, we reduce the dimension of the output of the first and second residual blocks through a max pooling layer, then flatten it through a Flatten layer, and finally apply a Dense layer. The output of the third residual block is mapped to a dimension consistent with the level features of the first two residual blocks through a Dense layer. Finally, all the level features are connected and nonlinearly mapped through a fully connected layer, as shown in Fig. 5. The hyperparameters of ATResnet are shown in Table 2. The output of the third residual block is also mapped through a fully connected layer. The hierarchical features extracted from all residual blocks are connected and then passed through a fully connected layer.

Fig. 5. Schematic diagram of ATResnet.

Download Full Size | PDF

Table 2. Hyperparameters of ATResnet

View Table | View all tables in this article

In the convolution layer of the residual neural network, the convolution kernel carried out the convolution operation on the multidimensional time series input to form the feature mapping, and then the nonlinear activation function f was applied to get the output mapping. The same filling was adopted in the convolution layer with the convolution step of one. With these manipulations, the feature map would keep its original shape. The calculation of every convolution layer was following the Eq. (7).

(7)$$v_{i,j}^k = f{({w^k} \ast x)_{i,j}} + {b_k}$$

Among them, 1 < i < W_D, 1 < j < H_D, W_D and H_D are the width and height of the tensor. ${w^k}$ is the weight matrix for the kth convolution kernel and bk is the bias for the kth convolution kernel (k = 1, 2, 3, …, N). The activation function f used RELU.

The Residual neural network introduced a shortcut connection between convolution layers which enabled information flow across layers, averted attenuation induced by multiple stacked nonlinear transformations and improved the network performance [42].

To fully extract the multiscale time features of fNIRS time serials, a temporal attention mechanism was added to every convolution layer, as presented in Fig. 6. First, three convolution kernels with different scales were used to capture features of multi-channel fNIRS series in the time dimension to obtain three feature maps named F1, F2 and F3. Then, information from F1, F2 and F3 branches were integrated to execute the attention operation. Specifically, the pooling layer in every branch reduced the dimension of representation which can enhance efficiency and avoid overfitting. Next, the trained branches were flattened to one-dimension vectors which were connected and went through a fully connected layer and a Softmax layer successively to generate the normalized attention weight of the three convolution kernels. Finally, the product of the attention weight and the feature maps of the three trained branches plus the attention weight got the eventual feature map ${F_j}$.

(8)$${F_j} = {a_1}{F_j} + {b_1}{F_j} + {c_1}{F_j},{a_1} + {b_1} + {c_1} = 1$$

Fig. 6. Schematic diagram of Multiscale temporal attention mechanism.

Download Full Size | PDF

2.7.2.2 BiLSTM:

BiLSTM, an expanded form of LSTM, input the time series to the LSTM model forwardly and then reversely. Then the two results were assembled to get the BiLSTM output. This manipulation promotes the learned long-term dependence relationship and enhances the model performance [43]. Each LSTM comprised three gated units, namely the forgetting gate (${f_t}$ ₎, input gate (${i_t}$) and output gate (${o_t}$) [43], the forgetting gate decided what information from the previous moment was retained in the cell state at the current moment, as shown in Eq. (9),

(9)$${f_t} = \sigma ({W_f}[{h_{t - 1}},{x_t}] + {b_f})$$

where σ is the sigmoid activation function, ${W_f}$ is the weight matrix, ${h_{t - 1}}$ is the output from the previous time points and ${b_f}$ is the bias of forgetting gate. The input gate decided how much information was retained to the present unit when updating, as shown in Eq. (10),

(10)$${i_t} = \sigma ({W_i}[{h_{t - 1}},{x_t}] + {b_i})$$

where W_i is the weight matrix; x_t is the present input and b_i is the bias of updating gate.

The forgetting gate could save useful information from the previous moment, and the input gate could avoid irrelevant information being kept. So, the product of forgetting gate output and unit state from the previous moment plus the product of input gate output and unit state of previous input gate (${\tilde{c}_t}$) equaled the current LSTM unit state, as shown in Eq. (11) and Eq. (12),

(11)$${\tilde{c}_t} = \tanh ({W_{\tilde{c}}}[{h_{t - 1}},{x_t}] + {b_c})$$

(12)$${c_t} = {f_t} \odot {c_{t - 1}} + {i_t} \odot {\tilde{c}_t}$$

where, ${W_{\tilde{c}}}$ is the weight matrix, ${\tilde{c}_t}$ is the vector of new candidate value to be added to the LSTM memory unit, ${\odot}$ is the elementwise multiplication, and c_t is the LSTM unit state.

The output gate decided what information of the present unit state was retained in the output ht according to Eq. (13) and Eq. (14),

(13)$${o_t} = \sigma ({W_o}[{h_{t - 1}},{x_t}] + {b_o})$$

(14)$${h_t} = {o_t} \odot \tanh ({c_t})$$

where W_o is weight matrix and ${b_o}$ is the bias of output gate.

2.7.3 Loss function and model implementation

In the ATRest-BiLSTM network, the categorical cross-entropy function was used as the loss function. Meanwhile, L1 regularization was added to the full connection layer to avoid overfitting and suppress the influence of irrelevant features. Thus, the loss function was made up of two parts, as shown in Eq. (15),

(15)$$loss ={-} \frac{1}{N}\sum\limits_{k = 1}^K {\sum\limits_{i = 1}^N {y_i^k\log \hat{y}_i^k + \lambda \sum\limits_{j = 1}^M {|{w_j}|} } }$$

where $\hat{y}_i^k$ is the output of TAResnet-BiLSTM for the ith sample, $y_i^k$ is the real category for the ith sample, λ is the L1 regularization parameter, N is the number of training samples, K is the number of classes, and w is the weight of the full connected layer.

There were only 27 subjects in this study, and each subject had 72 samples. This data volume couldn’t meet the DL need. To overcome this problem, we firstly divided each subject's samples into a training set and a testing set at 8:2, and then performed data augmentation on each subject's training set and increased the data by 250%. Then, we combined each subject's training set and augmented data to form the training set. We combined the testing sets of the 27 participants to form a new testing set, and divided the new testing set into a validation set and a testing set at 5:5. The number of samples in the training set, validation set, and test set used for modeling is 5443, 681, and 680, respectively.

Data augmentation within individuals could reduce data bias [44,45] and make the artificial data close to real data. In the model training process, stochastic gradient descent (SGD) was used as an optimizer, with a learning rate of 0.001. To improve SGD's optimization ability and speed up model training, we added Momentum to SGD, and set the weight attenuation to 10⁻⁵. To overcome overfitting, the data were randomly shuffled and L1 regularization was used in the full connection layer. In addition, an early stop mechanism was also applied, which monitored the loss of the model verification. If the verification loss was not improved within 15 epoches, the early stop mechanism was immediately triggered to stop training.

3. Results

3.1 Behavioral results

The subjects’ response time (key press time), accuracy, miss trials (trails without key press) without responses and error rate for each WML are as shown in Fig. 7. The analysis of variance (ANOVA) analyses with WML as the factor were performed and showed that the WML effect was significant for all these four behavioral parameters (p < 0.05). The response times of WML-6-D (1.04 s ± 0.12 s), WML-6 (0.98 s ± 0.11 s) and WML-4 (0.96 s ± 0.1 s) decreased gradually: WML-6-D > WML-6 > WML-4 (p < 0.05). The accuracies of WML-6-D (90.5% ± 7.4%), WML-6 (94.9% ± 4.67%) and WML-4 (97.5% ± 5.0%) increased gradually: WML-6-D < WML-6 < WML-4 (p < 0.05). For Miss trials, WML-6-D (1.40 ± 1.61) was higher than WML-6 (0.52 ± 0.92) and WML-4 (0.20 ± 0.50) (p < 0.05). However, there was no significant difference between WML-6 and WML-4 in miss trials. The error rates of WML-6-D (9.5% ± 7.4%), WML-6 (5.1%±4.67%) and WML-4 (2.5%±5.0%) decreased gradually: WML-6-D < WML-6 < WML-4 (p < 0.05).

Fig. 7. Behavioral results for each WML. (a) Response time; (b) Accuracy; (c) Miss trials; (d) Error rate. The data are expressed as mean ± standard error (SE), * represents p < 0.05, ** represents p < 0.01 and *** represents p < 0.0001.

Download Full Size | PDF

3.2 Comparison of blood oxygen response

In this study, we used an analogy-based data augmentation method to generate artificial data. Figure 8(a) shows the averaged waveforms across trials of the real blood oxygen data of a participant's left hemisphere channel 17 (ch17) under three VWM tasks, and Fig. 8(b) shows the averaged waveforms of the artificial blood oxygen data of channel 1 (ch1) in the right hemisphere. The gray highlighted rectangles in Fig. 1 indicate the stimulus periods. To compare real blood oxygen data with artificial blood oxygen data, we averaged the real data of 24 trials under each VWM task and the artificial blood oxygen data of 48 trials. As we can see, both real blood oxygen data and artificial blood oxygen data exhibit significant hemodynamic responses during stimulus periods. In addition, both not only have a high degree of similarity in temporal form but also exhibit similar activation characteristics. For example, in the VWM tasks, the activation of the channel (ch1) in right hemisphere is higher than that of the channel (ch17) in left hemisphere. The artificial blood oxygen data generated by this method exhibits greater fluctuations than the real blood oxygen data, due to the differences between different trials of the same VWM task and the introduction of noise from the original blood oxygen data.

Fig. 8. Average waveform analysis. (a) Hemodynamic responses of channel 1 in three VWM tasks. (b) Hemodynamic responses of channel 27 in three VWM tasks.

Download Full Size | PDF

3.2 ML classification results

The classification performances of SVM and LDA in recognizing the three levels of WML are shown in Fig. 9. LDA presented higher generalization performance than SVM. As presented in Fig. 7, the classification accuracy, recall, precision and kappa coefficients of LDA were 94.6%, 87.4%, 87.7% and 0.82 respectively, while the classification accuracy, recall, precision and kappa coefficients of SVM were 79.1%, 68.3%, 67.6% and 0.51 respectively.

Fig. 9. Evaluation of the intra-subject classification performance of SVM and LDA

Download Full Size | PDF

3.3 DL -TSC results

In order to decrease the model parameters and reduce the model complexity, we downsampled the data by taking an average of every five data points. In our proposed TAResnet-BiLSTM model, the multi-scale temporal attention mechanism was introduced into the residual neural network, and BiLSTM and TAResnet were combined in the WML decoding. The performance of TAResnet-BiLSTM was compared with other TSC models, including FCN, BiLSTM, Resnet,Inception and LSTM-Inception (Table 3).

Table 3. Comparison results of different classification methods.

View Table | View all tables in this article

Accuracy of all the DL-TSC approaches exceeded 85%. TAResnet which involved the attention mechanism performed better than the original Resnet in classification with 3.9% higher accuracy. TAResnet also slightly outperformed BiLSTM and FCN, but was inferior to Inception and LTSM-Inception. In addition, the classification accuracy of TAResnet-BiLSTM reached 92.4%. The confusion matrix of TAResnet-BiLSTM is shown in the Fig. 10. This manifestation testifies the benefit of a multiscale convolution kernel in extracting feature information and raising decoding accuracy.

Fig. 10. The confusion matrix of TAResnet-BiLSTM.

Download Full Size | PDF

4. Discussion

This study collected PFC hemodynamic signals from subjects during the Sternberg task, and then decoded WML using ML and DL algorithms. Based on selected features, traditional LDA decoded the three levels of WML efficiently and performed better than SVM with the classification accuracy reaching 94.6%. Moreover, we applied the DL-TSC approach to decode WML after data augmentation. We proposed a time series decoding architecture, TARestnet-BiLSTM, which combined the multi-scale residual neural network and BiLSTM and achieved an accuracy of 92.4% in inter-subject WML decoding.

The present study exploited the Sternberg paradigm close to real context, in which different WMLs come out pseudo-randomly with distracters. The behavioral responses of subjects verified the three different levels of WMLs, with the high WML educing low behavioral responses and vice versa. For traditional ML algorithms, LDA fitted for the intra-subject WML decoding more than SVM with shorter training time and better model recognition performance. Its classification accuracy reached up to 94.6% on the three WMLs. However, even with such high efficiency, traditional ML decoding highly depends on the feature extraction and selection. The presence of individual differences limits the ML application in BCI.

In FCN, continuous three-layer fully-connected convolutional neural networks are used for temporal feature extraction. The network structure of FCN make it impossible to extract multi-scale features. The Inception network extract multi-scale temporal feature through six consecutive Inception blocks, and adds a residual connection for each three consecutive Inception blocks to perform residual operations. Inception network extracted multi-scale features multiple by stacked Inception blocks and incorporate residual connections to help the model converge faster and easier. In the proposed TAResnet-BiLSTM, multi-scale convolution is used, multi-scale feature attention mechanism is adopted in multi-scale convolution to achieve feature fusion and the integrating hierarchical features was in fused in the network. The decoding accuracy of TAResnet and Inception were 89.7% and 90% higher than FCN, BiLSTM and Resnet. This is mainly because both TAResnet and Inception take into account the impact of various time scale features on classification and extract more comprehensive time features [46,47]. Physiologic systems are regulated by interacting mechanisms that operate across multiple temporal and spatial scales [48]. As a typical time series physiological signal, both noise and potential dynamics information are buried in the complex fluctuation of fNIRS signals. Extracting features by single scale convolution kernel may lose some hidden time characteristics, which was testified by the higher decoding accuracy of ATResnet and Inception than FCN and Resnet. It is worth emphasizing that our proposed ATResnet-BiLSTM TSC architecture achieved an accuracy of 92.4% in cross-individual decoding of three levels of WML. It proves that combining BiLSTM and convolutional neural networks substantially improves the WML decoding accuracy.

Neural network works favorably in feature mining and recognition, which, however, requires vast amounts of data to train the network [17]. With the small sample size, the model is prone to overfitting [49], resulting in a failure in the model training. Our research reduced the number of hidden units, allowed stochastically dropout some neurons during training and introduced an early stop mechanism to monitor the training process [50], which avoided overfitting largely. In addition, we augmented the data by 250%. After data augmentation, our ATResnet-BiLSTM TSC obtained favorable performance and showed no fitting phenomenon in the training process. These tactics efficiently solve the limitation of a small sample size due to experimental conditions 36 and enhance the BCI classification accuracy significantly [36].

The most important aspect of real-time applications of BCI is the decoding performance of the decoding algorithms. The TAResnet BiLSTM proposed in this study achieved high accuracy in offline cross individual WML decoding with relatively small data volume. In addition, our model's operation does not require high computational resources, which makes it possible to deploy it to real-time application environments. To achieve cross-subject WML evaluation in real-time environments, we can transfer the trained model parameters to new subjects for WML decoding.Although the ATResnet-BilSTM proposed in this study can fully extract temporal features of fNIRS signals, which dose not have a network structure design for spatial feature extraction. As a multi-channel time series, fNIRS signals contain of complex spatial features that are beneficial for improving accuracy of fNIRS decoding [51,52]. This is also an important direction for us to further optimize the model in the future.

At present, the effectiveness of near-infrared devices in detecting working memory load has been verified [18,53]. And with the development of fNIRS technology, portable and wearable fNIRS devices have also begun to be used for research on working memory assessment [54]. The development of these fNIRS devices enables real-time assessment of users’ working memory load in the real world and help to adjust work allocation for users to assist them in maintaining the best working state and preventing decision-making errors due to working memory overload. Furthermore, fNIRS, as a promising optical neuroimaging technology, has a wide range of uses other than working memory load measurement, such as clinical diagnosis [55], rehabilitation help [56], and emotional monitoring [57]. With the development of fNIRS technology, it will promote real-time WML classification applications.

5. Conclusion

In this study, the WMLs were decoded with ML and DL. As far as we know, this study is the first to apply the DL-TSC method to WML decoding. Our results are encouraging and could provide a basis for the brain-computer interface application of fNIRS in real-time WML detection. However, fNIRS detects hemodynamic signals which have an inherent delay of 2∼3 s to monitor neural activity [[ity [58]. To achieve a faster and more effective WML detection system, the fNIRS-EEG hybrid system can be used to monitor brain neural activity [45,59]. In the future, it is necessary for us to further verify and improve our model in the real-time environment.

Funding

National Natural Science Foundation of China (32000980 82171533); Basic and Applied Basic Research Foundation of Guangdong Province (2022A1515140142, 2020B1515120014); Key Laboratory Program of Guangdong Higher Education Institutes (2020KSYS001).

Acknowledgments

We would like to thank all subjects for their participation.

Disclosures

No conflicts of interest are declared.

Data availability

Data are available on request due to privacy/ethical restrictions.

References

1. G. Schalk, D.J. McFarland, T. Hinterberger, et al., “BCI2000: a general-purpose, brain-computer interface (BCI) system,” IEEE Trans. Biomed. Eng. 51(6), 1034–1043 (2004). [CrossRef]

2. N.F. Ramsey and M.P. Van De Heuvel, “Towards human BCI applications based on cognitive brain systems: an investigation of neural signals recorded from the dorsolateral prefrontal cortex,” IEEE Trans. Neural Syst. Rehabil. Eng. 14(2), 214–217 (2006). [CrossRef]

3. Z. He, Z. Li, F. Yang, et al., “Advances in multimodal emotion recognition based on brain-computer interfaces,” Brain. Sci. 10(10), 687 (2020). [CrossRef]

4. E. P. Torres, E. A. Torres, M. Hernández-Álvarez, et al., “EEG-based BCI emotion recognition: a survey,” Sensors 20(18), 5083 (2020). [CrossRef]

5. P. J. Lin, T. Jia, C. Li, et al., “CNN-based prognosis of BCI rehabilitation using EEG from first session BCI Training,” IEEE Trans. Neural Syst. Rehabil. Eng. 29, 1936–1943 (2021). [CrossRef]

6. R. Mane, T. Chouhan, and C. Guan, “BCI for stroke rehabilitation: motor and beyond,” J. Neural. Eng. 17(4), 041001 (2020). [CrossRef]

7. F. Pichiorri, N. Mrachacz-Kersting, M. Molinari, et al., “Brain-computer interface based motor and cognitive rehabilitation after stroke – state of the art, opportunity, and barriers: summary of the BCI Meeting 2016 in Asilomar,” Brain-Computer Interfaces 4(1-2), 53–59 (2017). [CrossRef]

8. N. Cowan, “The magical mystery four: How is working memory capacity limited, and why?” Curr. Dir. Psychol. Sci. 19(1), 51–57 (2010). [CrossRef]

9. P. C. Fletcher, T. Shallice, and R. J. Dolan, “The functional roles of prefrontal cortex in episodic memory. II. Retrieval,” Brain: a journal of neurology 121(7), 1249–1256 (1998). [CrossRef]

10. M. Petrides, “Frontal lobes and memory,” Handbook of neuropsychology 3(2), 75–90 (1989). [CrossRef]

11. B. Milner and M. Petrides, “Behavioural effects of frontal-lobe lesions in man,” Trends Neurosci. 7(11), 403–407 (1984). [CrossRef]

12. S. Sternberg, “High-speed scanning in human memory,” Science 153(3736), 652–654 (1966). [CrossRef]

13. H. Aghajani, M. Garbey, and A. Omurtag, “Measuring mental workload with EEG+ fNIRS,” Front. Hum. Neurosci. 11, 359 (2017). [CrossRef]

14. N. Naseer and K. S. Hong, “fNIRS-based brain-computer interfaces: a review,” Front. Hum. Neurosci. 9, 3 (2015). [CrossRef]

15. S. Dong and J. Jeong, “Onset classification in hemodynamic signals measured during three working memory tasks using wireless functional near-infrared spectroscopy,” IEEE J. Select. Topics Quantum Electron. 25(1), 1–11 (2018). [CrossRef]

16. A. De, “Prefrontal haemodynamics based classification of inter-individual working memory difference,” Electron. Lett. 56(25), 1406–1408 (2020). [CrossRef]

17. U. Asgher, K. Khalil, M. J. Khan, et al., “Enhanced accuracy for multiclass mental workload detection using long short-term memory for brain–computer interface,” Front. Neurosci. 14, 584 (2020). [CrossRef]

18. J. Wang, T. Grant, S. Velipasalar, et al., “Taking a deeper look at the brain: predicting visual perceptual and working memory load from high-density fNIRS data,” IEEE J. Biomed. Health Inform. 26(5), 2308–2319 (2021). [CrossRef]

19. T. Ma, S. Wang, Y. Xia, et al., “CNN-based classification of fNIRS signals in motor imagery BCI system,” J. Neural Eng. 18(5), 056019 (2021). [CrossRef]

20. A. Rajkomar, E. Oren, K. Chen, et al., “Scalable and accurate deep learning with electronic health records,” NPJ Digit Med. 1(1), 1–10 (2018). [CrossRef]

21. H. F. Nweke, Y. W. Teh, M. A. Al-Garadi, et al., “Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: state of the art and research challenges,” Expert. Syst. Appl. 105, 233–261 (2018). [CrossRef]

22. G.A. Susto, A. Cenedese, and M. Terzi, “Time-series classification methods: review and applications to power systems data,” Big data application in power system 2018, 179–220 (2018). [CrossRef]

23. C. L. Liu, W. H. Hsaio, Y. C. Tu, et al., “Time series classification with multivariate convolutional neural network,” IEEE Trans. Ind. Electron. 66(6), 4788–4797 (2018). [CrossRef]

24. H. Ismail Fawaz and G. Forestier, “Deep learning for time series classification: a review,” Data Min Knowl Disc 33(4), 917–963 (2019). [CrossRef]

25. Z. Cui, W. Chen, Y. Chen, et al., “Multi-scale convolutional neural networks for time series classification,” arXiv, arXiv:1603.06995 (2016). [CrossRef]

26. H. Y. Lu, E. S. Lorenc, H. Zhu, et al., “Multi-scale neural decoding and analysis,” J. Neural Eng. 18(4), 045013 (2021). [CrossRef]

27. Z. Zhang, B. Sun, H. Gong, et al., “A fast neuronal signal-sensitive continuous-wave near-infrared imaging system,” Rev. Sci. Instrum. 83(9), 094301 (2012). [CrossRef]

28. S. Brigadoi, L. Ceccherini, S. Cutini, et al., “Motion artifacts in functional near-infrared spectroscopy: a comparison of motion correction techniques applied to real cognitive data,” NeuroImage 85, 181–191 (2014). [CrossRef]

29. S. L. Engerman and K. L. Sokoloff, “Factor endowments: institutions, and differential paths of growth among new world economies: a view from economic historians of the United States,” NeuroImage 85, 181 (1994). [CrossRef]

30. L. Gagnon, M. A. Yücel, D. A. Boas, et al., “Further improvement in reducing superficial contamination in NIRS using double short separation measurements,” NeuroImage 85, 127–135 (2014). [CrossRef]

31. M. A. Yücel, A. Lühmann, F. Scholkmann, et al., “Best practices for fNIRS publications,” Neurophotonics 8(1), 012101 (2021). [CrossRef]

32. V. Della-Maggiore, W. Chau, P. R. Peres-Neto, et al., “An empirical comparison of SPM preprocessing parameters to the analysis of fMRI data,” NeuroImage 17(1), 19–28 (2002). [CrossRef]

33. G. Strangman, J. P. Culver, J. H. Thompson, et al., “A quantitative comparison of simultaneous BOLD fMRI and NIRS recordings during functional brain activation,” NeuroImage 17(2), 719–731 (2002). [CrossRef]

34. R. McKendrick, H. Ayaz, R. Olmstead, et al., “Enhancing dual-task performance with verbal and spatial working memory training: continuous monitoring of cerebral hemodynamics with NIRS,” NeuroImage 85, 1014–1026 (2014). [CrossRef]

35. N. Naseer, M. J. Hong, and K.S. Hong, “Online binary decision decoding using functional near-infrared spectroscopy for the development of brain–computer interface,” Exp. Brain. Res. 232(2), 555–564 (2014). [CrossRef]

36. N. Naseer and K. S. Hong, “Classification of functional near-infrared spectroscopy signals corresponding to the right-and left-wrist motor imagery for development of a brain–computer interface,” Neurosci. Lett. 553, 84–89 (2013). [CrossRef]

37. S. D. Wickramaratne and M.S. Mahmud, “Conditional-GAN based data augmentation for deep learning task classifier improvement using fNIRS data,” Front. Big. Data. 4, 1 (2021). [CrossRef]

38. F. Lotte, “Signal processing approaches to minimize or suppress calibration time in oscillatory activity-based brain–computer interfaces,” Proc. IEEE 103(6), 871–890 (2015). [CrossRef]

39. H. Wang, L. Xu, A. Bezerianos, et al., “Linking attention-based multiscale CNN with dynamical GCN for driving fatigue detection,” IEEE Trans. Instrum. Meas. 70, 1 (2020). [CrossRef]

40. L. I. Smith, “A tutorial on principal components analysis,” arXiv, arXiv:1404.1100 (2002). [CrossRef]

41. M. J. Khan and K. S. Hong, “Passive BCI based on drowsiness detection: an fNIRS study,” Biomed. Opt. Express 6(10), 4063–4078 (2015). [CrossRef]

42. R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” Adv. Neural Inf Process Syst 28, 1 (2015).

43. A. G. Felix, S. Jürgen, and C. J. Fred, “Learning to forget: Continual prediction with LSTM,” Neural. Comput. 12(10), 2451–2471 (2000). [CrossRef]

44. T. Nagasawa, T. Sato, I. Nambu, et al., “fNIRS-GANs: data augmentation using generative adversarial networks for classifying motor tasks from functional near-infrared spectroscopy,” J. Neural. Eng. 17(1), 016068 (2020). [CrossRef]

45. Y. Gao, H. Liu, F. Fang, et al., “Classification of working memory loads via assessing broken detailed balance of EEG-fNIRS neurovascular coupling measures,” IEEE Trans. Biomed. Eng. 70(3), 877–887 (2022). [CrossRef]

46. A. M. Roy, “Adaptive transfer learning-based multiscale feature fused deep convolutional neural network for EEG MI multiclassification in brain–computer interface,” Eng. App. Artif. Intel. 116, 105347 (2022). [CrossRef]

47. A. M. Roy, “An efficient multi-scale CNN model with intrinsic feature integration for motor imagery EEG subject classification in brain-machine interfaces,” Biomed. Signal Process Control. 74, 103496 (2022). [CrossRef]

48. M. Costa, A. L. Goldberger, and C. K. Peng, “Multiscale entropy analysis of biological signals,” Phys. Rev. E: Stat., Nonlinear, Soft Matter Phys. 71(2), 021906 (2005). [CrossRef]

49. M. M. Bejani and M. Ghatee, “A systematic review on overfitting control in shallow and deep neural networks,” Artif. Intell. Rev. 54(8), 6391–6438 (2021). [CrossRef]

50. R. Roelofs, V. Shankar, B. Recht, et al., “A meta-analysis of overfitting in machine learning,” Adv. Neural. Inf. Process Syst. 32, 1 (2019). [CrossRef]

51. X. Liu, Y. Shen, J. Liu, et al., “Parallel spatial–temporal self-attention CNN-based motor imagery classification for BCI,” Front. Neurosci. 14, 587520 (2020). [CrossRef]

52. Y. Zhang, D. Liu, T. Li, et al., “CGAN-rIRN: a data-augmented deep learning approach to accurate classification of mental tasks for a fNIRS-based brain-computer interface,” Biomed. Opt. Express 14(6), 2934–2954 (2023). [CrossRef]

53. R. Karthikeyan, J. Carrizales, C. Johnson, et al., “A window into the tired brain: neurophysiological dynamics of visuospatial working memory under fatigue,” Hum. Factors 66(2), 528–543 (2024). [CrossRef]

54. M. J. Saikia, “K-means clustering machine learning approach reveals groups of homogeneous individuals with unique brain activation, task, and performance dynamics using fNIRS,” IEEE Trans. Neural Syst. Rehabil. Eng. 31, 2535–2544 (2023). [CrossRef]

55. Y. Y. Wei, Q. Chen, A. Curtin, et al., “Functional near-infrared spectroscopy (fNIRS) as a tool to assist the diagnosis of major psychiatric disorders in a Chinese population,” Eur. Arch. Psychiatry Clin. Neurosci. 271(4), 745–757 (2021). [CrossRef]

56. M. Kim, S. Jang, D. Lee, et al., “A comprehensive research setup for monitoring Alzheimer’s disease using EEG, fNIRS, and gait analysis,” Biomed. Eng. Lett. 14(1), 13–21 (2024). [CrossRef]

57. Y. Zhu, J. K. Jayagopal, R. K. Mehta, et al., “Classifying major depressive disorder using fNIRS during motor rehabilitation,” IEEE Trans. Neural Syst. Rehabil. Eng. 28(4), 961–969 (2020). [CrossRef]

58. T. Gateau, G. Durantin, F. Lancelot, et al., “Real-time state estimation in a flight simulator using fNIRS,” PLoS One 10(3), e0121279 (2015). [CrossRef]

59. F. Putze, S. Hesslinger, C. Y. Tse, et al., “Hybrid fNIRS-EEG based classification of auditory and visual perception processes,” Front. Neurosci. 8, 373 (2014). [CrossRef]

Layer type	Maps	Kernel sizes	Number of kernels	Stride	Padding	Activation
Input Layer	(90,32)
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
Batch Normalization Layer	(90.64)
Activation Layer	(90,64)					relu
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
Batch Normalization Layer	(90,64)
Activation Layer	(90.64)					relu
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
Batch Normalization Layer	(90,64)
Bidirectional LSTM layer	(128,1)
Dense layer	(32,1)					relu
Dense layer	(32,1)					relu
Dense layer	(32,1					relu
Concatenate layer	(224,1)
Dense layer	(128,1)					relu
Dense layer	(3,1)					softmax

Layer type	Maps	Kernel sizes	Number of kernels	Stride	Padding	Activation
Input Layer	(90,32)
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
Batch Normalization Layer	(90.64)
Activation Layer	(90,64)					relu
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
Batch Normalization Layer	(90,64)
Activation Layer	(90.64)					relu
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
Batch Normalization Layer	(90,64)
Bidirectional LSTM layer	(128,1)
Dense layer	(32,1)					relu
Dense layer	(32,1)					relu
Dense layer	(32,1					relu
Concatenate layer	(96,1)
Dense layer	(3,1)					softmax

	ACCURACY	PRECISION	RECALL	F1-SCORE	KAPPA
FCN	87.8%	88.6%	87.9%	88.0%	0.82
BiLSTM	88.9%	88.8%	89.0%	88.9%	0.83
ResNet	85.8%	85.9%	85.6%	85.7%	0.79
Inception	90.0%	89.9%	90.0%	89.9%	0.85
TAResnet	89.7%	89.9%	89.7%	89.8%	0.85
LSTM-Inception	91.7%	91.8%	91.7%	91.7%	0.87
TAResnet-BiLSTM	92.4%	92.3%	92.3%	92.3%	0.88

Layer type	Maps	Kernel sizes	Number of kernels	Stride	Padding	Activation
Input Layer	(90,32)
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
Batch Normalization Layer	(90.64)
Activation Layer	(90,64)					relu
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
Batch Normalization Layer	(90,64)
Activation Layer	(90.64)					relu
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
Batch Normalization Layer	(90,64)
Bidirectional LSTM layer	(128,1)
Dense layer	(32,1)					relu
Dense layer	(32,1)					relu
Dense layer	(32,1					relu
Concatenate layer	(224,1)
Dense layer	(128,1)					relu
Dense layer	(3,1)					softmax

Layer type	Maps	Kernel sizes	Number of kernels	Stride	Padding	Activation
Input Layer	(90,32)
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
ATCONV Layer	(90,64)	[32,16,8]	64	1	same	linear
Batch Normalization Layer	(90.64)
Activation Layer	(90,64)					relu
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
ATCONV Layer	(90,64)	[16,8,4]	64	1	same	linear
Batch Normalization Layer	(90,64)
Activation Layer	(90.64)					relu
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
ATCONV Layer	(90,64)	[8,4,2]	64	1	same	linear
Batch Normalization Layer	(90,64)
Bidirectional LSTM layer	(128,1)
Dense layer	(32,1)					relu
Dense layer	(32,1)					relu
Dense layer	(32,1					relu
Concatenate layer	(96,1)
Dense layer	(3,1)					softmax

Working memory load recognition with deep learning time series classification

Abstract

1. Introduction

2. Materials and methods

2.1 Participants and paradigm

2.2 FNIRS data acquisition

2.3 FNIRS data preprocessing

2.4 Feature extraction

2.5 Feature selection and ML classification

2.6 Data augmentation

2.7 Deep learning classification method

2.7.1 Deep learning time series classification

2.7.2 TAResnet-BiLSTM

2.7.2.1. TAResnet:

2.7.2.2 BiLSTM:

2.7.3 Loss function and model implementation

3. Results

3.1 Behavioral results

3.2 Comparison of blood oxygen response

3.2 ML classification results

3.3 DL -TSC results

4. Discussion

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Tables (3)

Equations (15)

Biomedical Optics Express