Efficient human–robot collaboration during physical interaction requires estimating the human state for optimal role allocation and load sharing. Machine learning (ML) methods are gaining popularity for estimating the interaction parameters from physiological signals. However, due to individual differences, the ML models might not generalize well to new subjects. In this study, we present a convolution neural network (CNN) model to predict motor control difficulty using surface electromyography (sEMG) from human upper limb during physical human–robot interaction (pHRI) task and present a transfer learning approach to transfer a learned model to new subjects. Twenty-six individuals participated in a pHRI experiment where a subject guides the robot's end-effector with different levels of motor control difficulty. The motor control difficulty is varied by changing the damping parameter of the robot from low to high and constraining the motion to gross and fine movements. A CNN network with raw sEMG as input is used to classify the motor control difficulty. The CNN's transfer learning approach is compared against Riemann geometry-based Procrustes analysis (RPA). With very few labeled samples from new subjects, we demonstrate that the CNN-based transfer learning approach (avg. 69.77%) outperforms the RPA transfer learning (avg. 59.20%). Moreover, we observe that the subject's skill level in the pre-trained model has no significant effect on the transfer learning performance of the new users.
Physical human–robot interaction (pHRI) is becoming a crucial part of many industrial applications such as assembly, welding, painting, etc., where dexterous human capabilities can be leveraged along with the precision of industrial robots . Recent advances in machine learning are enabling the traditional robots to adapt to human operators by considering the operator's intention, cognitive and physical state to ensure safe and efficient collaboration . For such adaptive decision algorithms, physiological data such as electromyography, eye-tracking, and electroencephalography (EEG) form indispensable modalities [3,4]. However, learning a generalized decision algorithm using physiological signals is still challenging because of individual differences and temporal changes in physiological signals even within the same subject. In such scenarios, transfer learning approaches are a viable choice [5,6]. This work explores such a transfer learning approach to adapt a learned motor difficulty classification model to new subjects in a physical human–robot interaction experiment.
Robotic compliance and human sensory feedback are the key aspects of pHRI necessary to adapt the robotic system to new users and tasks and also reduce the chances of physical fatigue induced due to repetitive movements . Some popular control strategies for compliance control are impedance and admittance control [8,9]. These control strategies enable a stiff actuator equipped with position or force sensors to exhibit compliant behavior by rendering virtual dynamics. In such controllers, the robot compliance is rendered using the virtual inertia, damping, or stiffness parameters. One can achieve the desired behavior by tuning these parameters appropriately.
Most importantly, the desired values of these virtual parameters depend on the task type and the contact dynamics between the human and robot [10,11]. For example, increasing the virtual damping improves movement accuracy but requires more human effort whereas decreasing the damping facilitates a low-effort collaboration but that deteriorates the fine movement (FM) accuracy . In addition to the task type, contact dynamics play a major role in the stability of interaction, for instance, the robot becomes unstable while interacting with a stiff environment or even if the operator increases the grasp pressure [12,13]. Therefore, an adaptive control strategy is required to adjust to different users or tasks [11,14].
One of the primary modalities considered in adaptive robotic control strategy is the physiological signals such as surface electromyography (sEMG) and EEG [10,15,16]. Among different physiological signals, sEMG can help extract the contact dynamics information as it is more directly related to the human limb stiffness [10,12]. However, the usage of physiological signals such as sEMG to adjust robot control parameters is not straightforward due to the low signal-to-noise ratio and task dependence . Along with task dependency, physiological information is also subject dependent . The data recorded from the same patient at a different time under identical experimental conditions exhibit non-negligible differences . This inter-subject variability makes it difficult for classification algorithms to learn features that generalize well across different subjects.
Recently, deep learning approaches are gaining huge interest due to the vast amount of data available and their success in movement detection, gesture classification, and intent detection [20–23]. For instance, a recent review by Faust et al.  provides the superiority of convolution neural networks (CNNs) over the conventional machine learning algorithms in gesture classification using sEMG data. However, even deep learning approaches fail to generalize across subjects due to individual differences. To overcome this issue, researchers use transfer learning techniques in which a model trained on a specific domain can be adapted to a new domain by retraining only a few parameters of the network [5,24–27].
This paper explores a deep transfer learning approach to adapt a learned classification model to new subjects. For this purpose, we designed a physical human–robot interaction experiment in which the users perform fine and gross movements (GMs) by guiding an admittance-controlled robot. The damping parameter of the admittance controller is varied between predefined high and low values. Thus, the experiment has two factors with two levels, high/low for damping and fine/gross for task type. During the interaction, we collect the participant's sEMG from the forearm and use it for offline analysis to characterize the interaction into three categories. Based on the predicted category, the robotic system can choose to increase, decrease, or maintain the same level of virtual damping to ensure fluid interaction between the human and the robot. We compare two approaches: a feature-based approach and a deep CNN approach. The CNN architecture consists of two linear transformations and three convolution operations with a log-softmax output and no fully connected layers, inspired by Schirrmeister et al.  and Passalis et al. . A total of 26 subjects participated in the experiment. We trained a base classifier on 10 subjects and performed inter-subject transfer learning, using only 10% of the new subject's data. We compare the performance of CNN-based transfer learning approach with Riemann geometry-based Procrustes analysis (RPA).
2 Experimental Setup
We used a six-degree-of-freedom robotic system (Schunk Powerball LWA4P) with a six-axis force/torque sensor (Weiss KMS40) as shown in Fig. 1. A handle is attached to the robot's end-effector, and its position is mapped onto a virtual environment created using the CoppeliaSim simulator. Visual feedback is provided to the human using a screen placed in front of the human. During the experiment, sEMG data from an individual's forearm are recorded using a Myo armband (Thalmic labs) consisting of eight electrodes.
A total of four tasks are designed with varying levels of effort and task difficulty, i.e., low or high levels of damping and gross or fine movements for the task type as shown in Fig. 2. The selection of four tasks with two factors results in a full factorial design, and the participants perform these four tasks in a randomized order.
An admittance control strategy converts the human applied force (measured using the force/torque sensor) into the respective Cartesian space velocities by simulating virtual dynamics at the end-effector. Two virtual parameters, mass and damping, govern the dynamics of the end-effector. Here, virtual mass is a diagonal matrix Diag([3, 3, 3, 0.1, 0.1, 0.1]) and is kept constant throughout all four tasks, while the virtual damping is switched between two levels: high damping (HD, Diag([80, 80, 80, 10, 10, 10])) and low damping (LD, Diag([20, 20, 20, 6, 6, 6])). More detailed information on the admittance control implementation can be found in Ref. .
In addition to the damping levels, the task difficulty is varied by constraining the motion of hand to fine (Figs. 2(a) and 2(b)) and gross movements (Figs. 2(c) and 2(d)). A GM task allows freehand motion around two points in the workspace (denoted by bold circles). In contrast, a FM task requires the motion to be confined to the inner and outer boundaries (denoted by bold lines) separated by an average distance of 17.5 mm. As a result, GM and FM resemble a horizontal eight shape and a star shape, respectively (Fig. 2). Therefore, subjects perform four tasks as demonstrated in Fig. 2 by combining the task type (GM and FM) with the damping level (HD and LD), case 1: FM-HD, case 2: FM-LD, case 3: GM-HD, and case 4: GM-LD.
Note that the robot is more sensitive to the grasping pressure while performing fine movements in the LD setting  and, thus, causes movement oscillations as observed in the position data of case 2 in Fig. 2. The stability can be enhanced by increasing the damping to a level such that the controller can attenuate high-frequency force components and make the controller more stable. However, the HD setting demands more physical effort to manipulate the robot compared to the LD setting. Therefore, the controller should adjust the damping (HD or LD) to appropriate levels during physical human–robot interaction.
2.1 Human Subject Study.
Twenty-six subjects (age group 23–34 years, all right-handed) participated in the study (Fig. 3). The experiments were conducted after obtaining approval regarding the setup and procedure from the university's Institutional Review Board (IRB# 030-801361). All the participants were recruited from the University at Buffalo School of Engineering and Applied Sciences. Before the experiment, participants went through a trial run to familiarize themselves with the equipment and the experiment protocol. The total time for each task (a total of four tasks) was fixed to 3 min, during which the participants could traverse the shape multiple times. Each participant performed all four tasks in pseudo-randomized order with no external intervention during any of the tasks. All the subjects were asked to wear a Myo armband on the forearm of their dominant hand (Fig. 1) to record the sEMG data. The applied force and respective position data were time-synchronized with the sEMG, and all the data were recorded with a sampling rate of 150 Hz.
3.1 Class Labeling.
An assistive robot should select an appropriate damping level suitable for the task (FM or GM) to enable fluid interaction between humans and robots. For instance, Memar and Esfahani  showed that high damping is preferred while performing precise and stable movements whereas low damping is preferred for faster and less constrained movements. The robot would demand more effort and make the task difficult if an inappropriate damping level is selected during the interaction. Therefore, we can consider cases 1 and 4 as a single category during classification. This leaves the other two categories, cases 2 and 3, with undesired damping levels. For instance, the damping should be increased when case 2 is encountered and decreased when case 3 is encountered. Therefore, we categorize the four tasks into a three-class problem with cases 1 and 4 as a single category and cases 2 and 3 as two other categories. We use the recorded sEMG data to predict each category.
The three-class classification problem is solved using two main approaches as demonstrated in Fig. 4: (1) support vector machine (SVM) classifier with Riemann features and (2) raw sEMG data with a convolutional neural network (CNN). The continuous time-series data are split into constant length windows of 1 s known as epochs. Each new 1-s epoch is obtained after sliding the window by 500 ms, which constitutes a 50% overlap between windows. Riemann features are extracted from each 1 s window, and the resultant features are used to train an SVM classifier. On the other hand, a sEMG epoch is directly used to train the CNN classifier instead of extracting features.
3.2 Feature Extraction
3.2.1 Riemannian Features.
Recently, Riemann geometry has drawn a lot of attention in multivariate time-series classification [30,31]. Riemannian features are based on the covariance matrix extracted from the selected channels of a fixed window of time-series data. Barachant et al.  have shown that for brain computer interface applications, the mean covariance matrices for each class separate well on a Riemann manifold (RM). They obtained good classification accuracy with a simple minimum distance to the mean classifier in the manifold space. A more comprehensive review of the Riemann geometry and its applications for time-series data can be found in Ref. . Spatial covariance matrices can be extracted from the sEMG data and projected onto the tangent space for classification using SVM or linear discriminant analysis [33,34]. Additionally, an RPA-based  transfer learning approach can be implemented on these features to transform the new oncoming data (target) and match its statistics with that of the source data. Manjunatha et al.  extracted covariance matrices from eight channels of the sEMG data and applied RPA to demonstrate better classification performance across new sessions.
A classifier developed on the Riemannian feature space will work efficiently if the data statistics remain the same. However, the statistics of physiological data can vary significantly across new subjects thereby decreasing the classifier's performance. To address this problem, the Riemannian features of the target dataset have to be transformed on the manifold to match the statistics of the source data, which is done using a transfer learning approach proposed by Rodrigues et al.  that performs an affine transformation of the target data to align with the source data and learns a new classifier on the transformed target data.
3.3 Convolution Neural Network-Based Classification.
The CNN architecture shown in Fig. 5 is inspired from Refs. [28,29]. The network consists of two linear transformations and three convolution operations with a log-softmax output and no fully connected layers. The input to the network is the raw sEMG epoch. The first two operations are linear transformations. The first convolution operation is across time, which captures temporal information and the second convolution operation is across the sEMG channels capturing spatial information. After the temporal and spatial convolution, the output is squared, average pooled, and log-transformed. The temporal and spatial convolution can be combined into a single three-dimensional (3D) convolution operation. However, splitting the 3D convolution into two 2D convolutions facilitates the study of time and spatial domain features. The above architecture is inspired by the filter bank common spatial patterns (FBCSPs) method that has been very effective in EEG/sEMG classification studies . The squaring and log transformation implemented in this network is similar to trial log-variance computation in FBCSP. Figure 6 shows the trainable parameters and convolution operations of the network.
To understand the network architecture, let us consider a 1-s epoch from the raw sEMG signal En×d of size n × d where n is the number of electrodes and d is the number of data samples. In this study, each epoch's dimension is n = 8 and d = 150. For such an input, the first and second linear transformations are given in Eq. (1).
4 Transfer Learning Approach and Baselines
We consider an inductive transfer learning approach  where we have access to source data S (model trained on S will be denoted as SM) and a small subset of labeled target data (Tl). The transferred model is tested on unlabeled target data Tu. Here, Tl and Tu together form the target data, i.e., Tl∪ Tu = T. We balance Tl with an equal number of samples from each category to avoid training bias, and similarly, we balance Tu to obtain unbiased testing accuracy. The balanced Tl comes from the first 15 s of cases 2 and 3, and 7.5 s each of cases 1 and 2 for the third class (see Sec. 3.1).
4.2 Transfer Learning Procedure.
This section provides the details about the data splitting scheme to generate source S and target T dataset for inter-subject transfer learning. We perform five-fold cross-validation to assess the robustness. Figure 7 shows the scheme used for training the source model and data splitting of target data. There are two ways to select the base model for inter-subject transfer learning. In the first approach, 10 random subjects are selected to train the base network (S, data) and transfer to the remaining 16 subjects. The second way is to consider the individual performance (i.e., high performers/low performers) to train the base model and transfer it to the rest of the subjects. The shift-scale base CNN model is trained on S with 70% for training, 15% for validation, and 15% for testing from 10 subjects.
In inter-subject transfer learning, only the shift and scale layers of the source model (SM) are re-initialized and retrained using the labeled target data Tl whereas the weights of the convolution layers after the shifting and scaling layers are frozen (Fig. 8). Note that Tl is only from the first 15 s of the experiment (see Sec. 4.1). The rationale for choosing the shifting and scaling layers for transfer learning is as follows: the CNN architecture tends to learn feature representation in a hierarchical fashion where features extracted become progressively specific to a given task starting from first layers to final layers [37,38]. Thus, in the CNN architecture (Fig. 5), the first layers would learn generalizable shifting and scaling parameters across the subjects in source data S, and the last layers should learn task-specific features. Since the task is the same, but the subjects are different, we choose to retrain only the shift and scale layers for transfer learning. Such a transfer learning procedure results in learning subject-specific Wsh and Wsc. Another added advantage is that the shifting and scaling layers are linear. So, relearning is inexpensive and less time-consuming.
4.3 Baselines for Comparison.
Most of the transfer learning techniques are predominantly geared towards processing the data as images [5,27,39]. To use the existing transfer learning approaches on EMG data, an additional pre-processing step is required to convert the raw EMG signals to spectrograms  and treat the spectrograms as images. On the contrary, our approach eliminates the initial pre-processing step and utilizes raw EMG signals to train a CNN and then applies transfer learning by fine-tuning only a small set of parameters [28,29]. In this study, we compare the transfer learning results from CNN to that of the RM features transferred to new subjects. The transfer learning approach of the covariance matrices on the Riemann manifold is known as RPA . This approach requires a fully labeled source S and a partially labeled target dataset Tl and involves three major steps for transforming the covariance matrices, re-center, re-scale, and rotate. First, the source and target datasets are re-centered using their respective means covariance matrices, then the target data are scaled to match the dispersion of the source data, and finally, the target data are rotated. Following these operations, one can match the distribution of the source and target datasets and thus transfer the previously trained model to new subjects. In our previous work , we have shown that RPA-based transfer learning yields significant performance gain compared to the re-calibration applied on classical time domain features where the model has to be retrained completely for a new subject.
5 Results and Discussion
In this section, we initially compare the model performance for different epoch lengths and overlap percentages of the sEMG data. Then we select the best epoch length and overlap percentage to validate the inter-subject transfer learning performance of CNN and RM features.
5.1 Analysis of Hyper-Parameters.
Data preparation for the feature-based or raw sEMG (CNN) approach has two parameters: epoch length and overlap between sEMG epochs. To select the best set of parameters, we conducted a parametric study in which the epoch length is studied at 1 s and 2 s, and epoch overlap is set to 25%, 50%, and 75%. For the parametric study, we chose the pooled data of all the subjects and calculated five-fold cross-validation accuracies. The procedure is the same for both feature-based classifiers and CNN. The training time for CNN is ∼252.6 s and for SVM with Riemann features, the training time is ∼18.4 s. Note that the training time for the CNN model depends on many factors such as GPU memory, machine learning framework, and data-loading techniques.
Table 1 provides the classification accuracy of motor difficulty with different epoch lengths and 50% overlap between epochs. The classification accuracy decreases as the epoch length increases in both the feature-based and the CNN approaches, with the CNN approach performing consistently better. The decrease in classification accuracy might be because of the increase in noise within the epoch. Also, the number of data points available for training decreases.
|Method||1-s epoch||2-s epoch|
|SVM (RM)||84.62 ± 0.98||83.59 ± 2.29|
|CNN||87.44 ± 0.28||84.14 ± 0.64|
|Method||1-s epoch||2-s epoch|
|SVM (RM)||84.62 ± 0.98||83.59 ± 2.29|
|CNN||87.44 ± 0.28||84.14 ± 0.64|
Table 2 provides the classification accuracy with different overlap percentages with fixed 1-s epoch length. The classification accuracy increases as the epoch overlap increases in both the feature-based and the CNN approach; however, CNN performs better than the feature-based approach. An overlap of 75% produces higher classification as the number of data points available for learning is more, but with a higher training period. On the other hand, an overlap of 25% results in less training period but with less accuracy. We chose an epoch length of 1 s with a 50% overlap to balance the training period and classification accuracy.
5.2 Convolution Neural Network Architecture Analysis.
This section provides an analysis of features at different stages of the CNN architecture (Fig. 5). Specifically, we have studied the feature after shift and scale layers to provide insights into the transfer learning procedure.
For an inter-subject transfer learning approach using the CNN, we chose to retrain the shift and scaling layers (Fig. 8). The hypothesis was that the shift and scaling layers act as normalizing layers for the new subject's data while the subsequent layers act as the fixed feature extractors (Fig. 6). In other words, the shift and scale layers try to push Tl towards the S (see Sec. 4.1 for data splitting scheme). This is because the fixed feature weights (Wsp and Wc) are trained on the S. To explore this further, we analyzed the features just after the shift and scale layers, i.e., E′ = (E − Wshe′) ☉ (Wscẽsh) for a test subject. An average cosine similarity is calculated between the of S (10 subjects’ data) and of Tl. As shown in Table 3, the cosine similarity is increasing with the number of training steps which suggests that the similarity between Tl and S is increasing. Furthermore, we also performed a t-distributed stochastic neighbor embedding (t-SNE)  of the of S and of Tl (subject s2). As seen in Fig. 9, the overlap between the target data and source is increasing over subsequent training steps. Thus, the shift and scale layers act as normalizing layers pushing Tl towards S distribution.
5.3 Transfer Learning Results.
Figure 10 provides the inter-subject transfer learning results on 16 different subjects. Transfer learning based on CNN consistently performs better than the RPA baselines (see Sec. 4.3). Out of 16 subjects, CNN-based transfer learning performs better than RPA in 13 subjects. However, the performance is not drastically different among the three subjects where RPA is better than CNN. Only 10% (or the first 15 s data of each task) has been used for the transfer learning procedure. To statistically establish the performance gain using the CNN approach, we conducted a repeated measure analysis of variance (ANOVA) test. For all the statistical tests, we have used a significance level (α) of 0.05. The test revealed that CNN performs statistically better than the RPA method (p-value 0.0002). The lower performance of RPA might be because the statistics of S and Tl are drastically different and the manifold-based transformations cannot capture it. Furthermore, as the CNN approach outperforms the RPA approach, the rest of the analysis uses only the CNN approach.
We further explored whether the base model choice significantly affects the transfer learning performance. For instance, to study this hypothesis, one can select high-performing/low-performing subjects from the pool of 26 subjects to train a base model and then transfer to the rest of the subjects. To choose high and low performers from the pool of 26 participants, we have used a quantitative metric known as the instability index, IR (see Appendix A). This metric is computed from the interaction force data recorded using the force/torque sensor. The IR index has been previously shown to capture the interaction instability and reflects the subject's motor control ability [15,42]. For calculating the instability index, we have used the case 2 scenario (see Sec. 2) as it represents the most unstable interaction with the robot (Fig. 2(b)). Note that increasing instability index signifies lower motor control ability and lower instability index indicates higher motor control ability.
Out of 26 subjects, the six subjects with the highest instability index are labeled as low performers, and five subjects with the lowest instability index as high performers. Two base models were trained using the high performers and low performers and transferred to the rest of the subjects. Again, we used a repeated measure ANOVA (with significance level 0.05) to test whether base model choice affects transfer learning performance. The test revealed no significant transfer learning performance (on the same subject) gain when the base model is switched from high performers to low performers.
Note that inter-subject transfer learning results hold good only when the labeled Tl is good enough to learn the data distribution. To further analyze this, we repeat the transfer learning with differently labeled target data (Tl) size for new subjects. Table 4 shows the average classification accuracy across new subjects when the labeled target data (Tl) size is varied from 10% to 30% (of T) in steps of 5%. The performance of both CNN-based transfer learning and RPA increases as the data available for training also increases. A probable explanation might be that as the training data increases, the classifiers can learn the data distribution better, increasing performance. However, CNN-based transfer learning outperforms RPA-based transfer learning. In addition, the current approach facilitates the transfer learning approach due to the structure of the CNN. The transfer learning approach is not possible in other classical approaches, such as SVM or random forest classifier (RFC). These algorithms (SVM/RFC) have to be retrained completely for a new subject that involves more computational time. On the contrary, the CNN transfer learning approach used in this study only requires significantly less parameters to be retrained while freezing the rest of the network parameters.
In this work, we classify the motor difficulty during physical human–robot interaction using sEMG. We designed a collaborative task where the human subject guides an admittance-controlled robot. The motor difficulty in guiding the robot is changed by varying the admittance controller's damping parameter from LD to HD and task type from GM to FM. Therefore, a total of four experiments were performed by each subject and the sEMG signals are recorded during the experiment using a Myo armband placed on the subject's forearm.
Based on the damping level and task type, the motor control difficulty is modeled as a three-class classification problem, where we used two approaches: an SVM classifier with Riemannian features and a convolution neural network with raw sEMG. The results demonstrated that CNN outperforms the SVM classifier with Riemannian features. However, both approaches perform poorly while classifying the new subject's data. This is mainly because of the individual differences among subjects.
To overcome this issue, we used a transfer learning technique, where we partially trained a pre-trained CNN model by freezing the weights and re-initializing only a few layers of the network. We used 10% of the new data (first 15 s of the experiment) for training, and the remaining 90% for testing. We demonstrated that the inter-subject classification accuracy significantly increased when we use transfer learning, and the CNN-based transfer learning outperformed the other transfer method. We also demonstrated that the skill level of subjects considered for training the base model does not have a significant effect on the transfer to new users. This is particularly useful for adapting the robot control strategy to new users involved in the workforce training program.
In the future, we will include a real-time adaptation strategy for the robot based on motor difficulty detected using the three-class classification model's output. However, such direct use of classification should also explore the appropriate update rate of control strategy as more frequent updates might result in an unstable interaction.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.
Appendix A: Analysis of Instability Index
Here, ω0, ωN, and ωc denote the lowest, cutoff, and the Nyquist frequencies, respectively. The cutoff frequency is selected as 3 Hz based on the maximum frequency of voluntary upper limb movements. P(ωi) denotes the PSD corresponding to a frequency, ωi, of the signal. The value of the IR index is in the range of (0, 1) as it is the ratio of power above cutoff frequency to the total power. Higher value of IR corresponds to more power in the high-frequency region and thus, an indication of higher instability. Therefore, IR index can be used as a metric to recognize high and low performers in the experiment.