a machine learning approach to automate facial...

8
A Machine Learning Approach to Automate Facial Expressions from Physical Activity Tarik Boukhalfi Multimedia Lab, École de technologie supérieure Montreal, Canada tarik.boukhalfi[email protected] Christian Desrosiers Multimedia Lab, École de technologie supérieure Montreal, Canada [email protected] Eric Paquette Multimedia Lab, École de technologie supérieure Montreal, Canada [email protected] ABSTRACT In this paper, we propose a novel approach based on machine learning to simulate facial expressions related to physical activity. Because of the various factors they involve, such as psychological and biomechanical, facial expressions are complex to model. While facial performance capture provides the best results, it is costly and difficult to use for real-time interaction. A number of methods exist to automate facial animation related to speech or emotion, but there are no methods to automate facial expressions related to physical activity. This leads to unrealistic 3D characters, especially when performing intense physical activity. The purpose of this research is to highlight the link between physical activity and facial expression, and to propose a data-driven approach providing realistic facial expressions, while leaving creative control. First, biological, mechanical, and facial expression data are captured. This information is then used to train regression trees and support vector machine (SVM) models, which predict facial expressions of virtual characters from their 3D motion. The proposed approach can be used with real-time, pre-recorded or key-framed animations, making it suitable for video games and movies as well. Keywords Facial Animation - Biomechanics - Physical Activity - Machine Learning. 1 INTRODUCTION Whether in video games or in movies, 3D animation is becoming more and more realistic, particularly with the use of motion capture. However, facial animation remains one of most tricky, time-consuming, and costly aspects of 3D animation. Facial expressions are diffi- cult to model because of the numerous factors underly- ing them: emotions (joy, sadness, etc.), mouth move- ments (speech, deep breath, etc.), eye and eyelid move- ments (blinking, gaze direction, etc.) and physiological (fatigue, pain, etc.). Different approaches for automating facial expressions related to emotion or speech exist, but none are avail- able to automate expressions related to physical activ- ity. In the visual effects and computer animation com- munities, facial animations are most often key-framed or motion-captured. Even though this is a relatively long and costly procedure, it is understandable for main characters. For secondary characters, such as a crowd Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. for example, the facial animation related to physical ac- tivity will often be disregarded. In the video game in- dustry, although characters often have to provide sig- nificant physical exertion, facial animation related to this component is somewhat neglected. It is sometimes present during cinematic sequences, but it suffers from a crude approximation during gameplay, and it is often simply overlooked. According to discussions we have had with video game companies, current approaches for gameplay facial animations related to physical activity rely on ad hoc techniques based on linear functions and thresholds. Such approaches are far from the complex- ity of human facial animations, in relation to physical activity. In this paper, we propose a novel approach based on machine learning to simulate facial expres- sions related to physical activity, in order to improve the realism of 3D characters. The approach is based on the analysis of motion capture data acquired from real exercise sessions. Given the captured animations and physiological data, specific machine learning tech- niques are selected to enable the synthesis of facial ex- pressions corresponding to physical activity. The main contributions of the proposed approach can be summa- rized as: A machine learning framework to derive facial ex- pressions from physical activity; An approach to link mechanical, physiological, and facial measurements;

Upload: others

Post on 28-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Machine Learning Approach to Automate Facial ...profs.etsmtl.ca/epaquette/Research/Papers/Boukhalfi.2015/...A Machine Learning Approach to Automate Facial Expressions from Physical

A Machine Learning Approach to Automate FacialExpressions from Physical Activity

Tarik BoukhalfiMultimedia Lab, École detechnologie supérieure

Montreal, [email protected]

Christian DesrosiersMultimedia Lab, École detechnologie supérieure

Montreal, [email protected]

Eric PaquetteMultimedia Lab, École detechnologie supérieure

Montreal, [email protected]

ABSTRACTIn this paper, we propose a novel approach based on machine learning to simulate facial expressions related tophysical activity. Because of the various factors they involve, such as psychological and biomechanical, facialexpressions are complex to model. While facial performance capture provides the best results, it is costly anddifficult to use for real-time interaction. A number of methods exist to automate facial animation related to speechor emotion, but there are no methods to automate facial expressions related to physical activity. This leads tounrealistic 3D characters, especially when performing intense physical activity. The purpose of this research is tohighlight the link between physical activity and facial expression, and to propose a data-driven approach providingrealistic facial expressions, while leaving creative control. First, biological, mechanical, and facial expression dataare captured. This information is then used to train regression trees and support vector machine (SVM) models,which predict facial expressions of virtual characters from their 3D motion. The proposed approach can be usedwith real-time, pre-recorded or key-framed animations, making it suitable for video games and movies as well.

KeywordsFacial Animation - Biomechanics - Physical Activity - Machine Learning.

1 INTRODUCTIONWhether in video games or in movies, 3D animationis becoming more and more realistic, particularly withthe use of motion capture. However, facial animationremains one of most tricky, time-consuming, and costlyaspects of 3D animation. Facial expressions are diffi-cult to model because of the numerous factors underly-ing them: emotions (joy, sadness, etc.), mouth move-ments (speech, deep breath, etc.), eye and eyelid move-ments (blinking, gaze direction, etc.) and physiological(fatigue, pain, etc.).

Different approaches for automating facial expressionsrelated to emotion or speech exist, but none are avail-able to automate expressions related to physical activ-ity. In the visual effects and computer animation com-munities, facial animations are most often key-framedor motion-captured. Even though this is a relativelylong and costly procedure, it is understandable for maincharacters. For secondary characters, such as a crowd

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this notice andthe full citation on the first page. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee.

for example, the facial animation related to physical ac-tivity will often be disregarded. In the video game in-dustry, although characters often have to provide sig-nificant physical exertion, facial animation related tothis component is somewhat neglected. It is sometimespresent during cinematic sequences, but it suffers froma crude approximation during gameplay, and it is oftensimply overlooked. According to discussions we havehad with video game companies, current approaches forgameplay facial animations related to physical activityrely on ad hoc techniques based on linear functions andthresholds. Such approaches are far from the complex-ity of human facial animations, in relation to physicalactivity. In this paper, we propose a novel approachbased on machine learning to simulate facial expres-sions related to physical activity, in order to improvethe realism of 3D characters. The approach is basedon the analysis of motion capture data acquired fromreal exercise sessions. Given the captured animationsand physiological data, specific machine learning tech-niques are selected to enable the synthesis of facial ex-pressions corresponding to physical activity. The maincontributions of the proposed approach can be summa-rized as:

• A machine learning framework to derive facial ex-pressions from physical activity;

• An approach to link mechanical, physiological, andfacial measurements;

Page 2: A Machine Learning Approach to Automate Facial ...profs.etsmtl.ca/epaquette/Research/Papers/Boukhalfi.2015/...A Machine Learning Approach to Automate Facial Expressions from Physical

• An analysis of the most effective way to computeenergy values for machine learning purposes;

• A set of empirical rules relating physical activity tospecific facial expressions;

• A normalization procedure to make better use ofheart rate and blend shape data.

With this machine learning framework and captureddata, we are able to synthesize realistic facial expres-sions. The approach can be used for real time as wellas off-line facial animation. Furthermore, the methodallows for control over stylization, as key-framed datacould be used instead of captured data. It enables con-trol over expressiveness, as the animator can adjust var-ious parameters that have an impact on the facial ex-pression of the character. Finally, the models developedin this work also provide metabolic data that could beused for purposes other than facial animation.

2 RELATED WORKPrevious works are categorized in four topics: the de-scription of facial expressions, the synthesis of speech-related expressions, the synthesis of emotion-relatedexpressions, and the description of facial expressionsrelated to physical activity.

2.1 Facial Expression DescriptionSeveral objective and systematic approaches to encodefacial expressions have been proposed. Although fa-cial expressions are due to a wide range of factors,only facial changes due to emotions, intentions or so-cial communication are taken into account [12]. Var-ious coding systems have been developed mainly forpsychological studies, including FACES (Facial Ex-pression Coding System) [9] and FACS (Facial ActionCoding System) [4], and are presented in a survey pa-per [13]. FACS is an anatomically-based expressionspace grouping together facial muscle groups as AUs(Action Units), whose combination can be used to formany possible expression [20]. The MPEG-4 standardproposes a similar approach using FAPs (Facial ActionParameters), which has been used in various researchprojects. In the proposed approach, the facial expres-sions are built using blend shapes that correspond tofacial muscle groups similar to the FACS approach.

2.2 Speech-Related Facial ExpressionsAnimation synthesis is generally done by analyzing anaudio input, extracting phonemes, and then animat-ing the 3D face model’s visemes (phoneme’s visualcounterpart) [10]. Different approaches have also beendeveloped to enhance realism in animation, such asblending speech-driven animation into emotion-drivenanimation [11] and using anatomically-based struc-tures [11, 21]. Other works have focused on improvingthe visual behavior related to speech [24]. Some works

use machine learning methods such as SVMs [25] orneural networks [3, 15]. All of these methods help toachieve more realistic results in facial expressions re-lated to speech, and substantially reduce manual anima-tion time. They give good results, given that there is asingle input (the audio) that captures all of the requiredinformation to adjust the facial animation.When considering physical activities, a character’s mo-tion involves several limbs, as well as potential and ki-netic energy, torques, etc., which results in a broader setof inputs. Furthermore, simulating speech-related ani-mation from audio is synchronized to one input signal,while the facial expression from the character’s motionmight be the result of both its instantaneous motion andthe movements or activities performed by the characterin the past few minutes. Finally, the character’s motiontriggers facial expressions that will influence parts ofthe face that are not related to speech, such as the re-gion around the eyes. For these reasons, works dealingwith speech cannot fully solve the problem of generat-ing the facial animation related to physical activity.

2.3 Emotion-Related Facial ExpressionsResearchers in psychology studied emotions and cameup with classifications based on a limited number ofemotions. To further simplify the relationships betweenemotions, they can be represented in simpler 2D expres-sion spaces [18, 23]. Computer graphics researchershave taken advantage of such approaches and pro-posed different two dimensional emotion layouts thatallow a meaningful blending between emotions [19].Other approaches have relied on coding systems suchas FACS to provide realistic transitions between emo-tion expressions [1]. While these works provide in-teresting approaches for the transition and blending offacial expressions, they work when the emotion is al-ready known, and when a set of face poses is provided.Creating the mesh deformations corresponding to a setof emotions was studied in the work of Puklavage etal. [16]. Animating the right combination of emotionsthrough time remains a complex problem. It is simi-lar to the challenge involved in this work: developingan approach that can predict the facial expression fromobservations and models describing how a human sub-ject reacts in different circumstances.

2.4 Physical Activity-Related Facial Ex-pressions

Even though 3D characters often perform intense phys-ical activities, we could not find any research that ad-dresses the automatic and realistic facial animation re-lated to moderate and intense physical activity. Outsideof the computer graphics literature, the work of McKen-zie [14] describes the facial expressions related to sub-stantial effort, exhaustion, and fatigue. In the com-puter graphics field, facial animation literature related

Page 3: A Machine Learning Approach to Automate Facial ...profs.etsmtl.ca/epaquette/Research/Papers/Boukhalfi.2015/...A Machine Learning Approach to Automate Facial Expressions from Physical

MechanicalWork

Facial ExpressionHeart Rate, EPOC, etc.

Machine Learning Machine Learning

Motion Capture DataO

�-li

ne P

roce

ssRe

al-T

ime

Proc

ess

Strength CardioFacial Capture Data

3D CharacterMotion

Regression TreeModel

SVMModels

Figure 1: Machine learning facial expressions

to physical activity is found mostly in art books [5, 7].These contain numerous facial expressions, some ofwhich are related to physical activity. While theseworks provide useful information for manually animat-ing facial expressions, they are not useful in the contextof automatic facial animation from the character’s mo-tion. Other works [26, 2] target the modeling and con-trol of 3D models in relation to laughter or respiration.Numerous works address facial animation in the con-text of speech and emotion, but they are not adapted tothe synthesis of physical activity expressions. Based onmachine learning and captured data, the proposed ap-proach derives a model, to animate facial expressions.

3 FACIAL EXPRESSION SYNTHESISFig. 1 shows an overview of the proposed approach.The first step is to acquire real-life motion capturedata, providing information on the facial expressionsobserved under various types of exercise. This data isused to train machine learning models, which are thenused to generate realistic facial expressions.

3.1 Data AcquisitionVarious capture sessions were conducted in order togather the information required to develop a model thatgives realistic results. During these sessions, informa-tion describing the type of activities and physical state,heart rate, and facial expressions were recorded.A full-body motion capture was done in a large roomwhere 15 participants of different age [20, 46] and train-ing level [0, 7.5] exercised freely without training de-vices. The motion capture system cameras’ resolutiondid not permit facial capture along with the full-bodycapture. Furthermore, using training devices duringthis capture session would have led to markers occlu-sion. For these reasons, the data acquisition was splitin two capture sessions: full-body and facial. The goalof the full-body session was to provide data to estab-lish the relationship between the motion and physiolog-ical measures, such as the heart rate. The participants

were asked to alternate between exercises of low andhigh levels of intensity, and to slow down to make surea large range of data was acquired. While participantswere training, both full-body motion and heart rate datawere recorded. The software used to record the heartrate provided an estimation of other metabolic indica-tors, such as metabolic energy consumption, breathingrate and EPOC (Excess Post Oxygen Consumption).The second capture session was done at a fitness center,where 17 participants from the same age groups andtraining level as the previous session, were asked to ex-ercise on either a cardiovascular training machine or amixture of strength training machines and free weights.The goal of this capture session was to establish the re-lationship between motion, heart rate, and facial expres-sion. Again, this capture involved exercising at differ-ent levels of intensity and a slow-down period. Usingthis procedure, the data collected for each exercise in-cluded repetitions for the same participant as well as fordifferent participants. Facial expressions were filmed,while heart rate data were recorded following the sameprocedures and with the same material as the first cap-ture session. Together with the height and weight of theparticipants, the specific loads used with the strengthtraining machines and free weights allowed for a goodapproximation of the involved work and forces.

3.2 Biomechanical ModelOne of the key inputs to both the off-line and on-line phases of the proposed approach is the mechani-cal work resulting from the motion. Different meth-ods were evaluated to approximate the work: potentialenergy, translational kinetic energy, and rotational ki-netic energy. Different ways of evaluating the mechan-ical energies were tested: using the center of mass ofthe whole character, lower/upper body and computingthese values for each limb. Potential energy, transla-tional and rotational kinetic energies were used.Tests were conducted to find an approach that would beefficient to compute, while providing good results forboth the learning and synthesis phases. While separateinputs for each limb intuitively seemed to provide betterknowledge about the type of exercise and effort, theyresulted in noisy facial animations with blend shapeweights that changed too rapidly compared to the realdata. An explanation for this phenomenon is that eventhough there are several captures, the amount of inputdata is still too small to correctly capture the intricateinterrelations between specific limbs and facial expres-sions. Ultimately, what provided the best result wasusing the sum of the mechanical work (potential, trans-lational kinetic and rotational kinetic) for all limbs.

3.3 Machine Learning Facial ExpressionsTo get a better understanding of the underlying mech-anisms and relations between the exercises and the fa-

Page 4: A Machine Learning Approach to Automate Facial ...profs.etsmtl.ca/epaquette/Research/Papers/Boukhalfi.2015/...A Machine Learning Approach to Automate Facial Expressions from Physical

cial expressions, a preliminary analysis of the data wasconducted. As the relations between the metabolic,mechanical and facial parameters are too complex tomodel using simple polynomial equations, it madesense to use machine learning to predict the facial ex-pressions. Given the type of captured data and the kindof predictions required, regression techniques were themost appropriate. To select the best approach, severalmodels were trained using different sets of features asinput, and the quality of these models was evaluatedon a test data set. Likewise, appropriate model parame-ters were selected using cross-validation. The data flow,learning approaches and models are presented in Fig. 1.

3.3.1 Metabolic Parameters PredictionTo predict the heart rate from the character’s motion,various learning techniques were tested with differentcombinations of features as input. The heart rate in-creases or decreases depending on the intensity of theexercise: for each person, there is a certain threshold inexercise intensity that results in an increase or decreasein the demand for oxygen. To model this behavior, re-gression trees were found to give an accuracy compa-rable to more complex models such as SVM. Further-more, this technique was also selected for its abilityto provide a human-interpretable model, which can beused to get more artistic control on the final result.

Since the range of input values affects learning tech-niques, and as the range of heart rates varies among thecandidates, the data was transformed to the [0,1] inter-val resulting in the normalized heart rate (NHR):

NHR =current heart rate− resting heart rate

maximum heart rate− resting heart rate

The maximum and resting heart rates are found in stan-dard training charts based on the age and training level.

A regression tree model was built using the UserClassi-fier in the Weka Software [8]. Several combinations ofinputs were tested (mechanical work as input and heartrate as output, mechanical work and heart rate differ-ence as input and heart rate difference as output, etc.).Among the tested models, the one providing the bestresults was to predict the difference in heart rate usingthe current heart rate and the instantaneous mechanicalwork. By using the last predicted NHR in the subse-quent prediction, the model considers the temporal in-formation and the accumulated fatigue.

While a model trained using the data from a singleparticipant could accurately predict the heart rate ofthis participant (correlation coefficient of 0.88 and root-mean-square error – RMSE – of 19%, see Fig. 2(a, c)),combining the data from every participant in a singlemodel resulted in significant errors (correlation coeffi-cient of 0.21 and RMSE of 79%, see Fig. 2(b, d)). Toimprove the results, simpler models were derived.

0

1

50 min

(a)

0

1

1050 min

(b)

0

1

50 min

(c)

0

1

1050 min

(d)Figure 2: Comparison between real (blue) and pre-dicted (red) NHR as a function of time (in minutes). (a-b) cardiovascular and (c-d) strength training exercises.(a,c) the same candidate and (b,d) a different candidate.

Using appropriate regularization parameters and treepruning to limit the complexity of the regressionyielded simple regression trees with only two leaves.By analyzing these simpler models for each partici-pant, a common structure emerged: all regression treeshad the same test at the root node, and the test com-pared the mechanical work w to a threshold value troot ,which broke down the predictions into increases or de-creases in the heart rate. Moreover, the main differencebetween these personalized models was the thresholdvalue used in the test, this value depended largely onthe fitness level of the participant. Based on these ob-servations, a linear regression between the training leveland troot was used to improve the model. The resultingmodel based on the linear regression and regression treeis as follows:

troot = 7.13+0.42× training level

∆(nhr) ={

cinc × (w− troot)× (1−nhr)2 troot < wcdec × (w− troot)×nhr2 w ≤ troot

cinc = 0.0056−0.00043× training level

cdec = 0.0009+0.00025× training level

The threshold troot determines when the heart rate startsto increase while cinc and cdec are the factors of increaseor decrease. These values were obtained by calculatinga linear regression between the individual values ob-tained in the regression tree of each participant.

Using this model provided a prediction that was almostas good as the one for separate candidates, with corre-lation coefficient of 0.87 and an RMSE of 24%. Fur-thermore, the training level can be used by animators asa parameter to control the response level of charactersto various types of exercises.

Page 5: A Machine Learning Approach to Automate Facial ...profs.etsmtl.ca/epaquette/Research/Papers/Boukhalfi.2015/...A Machine Learning Approach to Automate Facial Expressions from Physical

0

20

40

60

80

100

1050 min

(a)

0

20

40

60

80

100

20100 sec

(b)

01050 sec

20

40

60

80

100

(c)

0

20

40

60

80

100

1050 sec

(d)Figure 3: Comparison between real (blue) and pre-dicted (red) mouth stretching blend shape weight (0 to100) for (a) a cardiovascular exercise and (b),(c), and(d) strength training exercises. Curves in (a) and (b) arefor the prediction of the same person for the same exer-cise. Curves in (c) are for the prediction of a differentperson, but the same exercise, and (d) is the predictionof the same person, but a different exercise.

The normalized heart rate is used to approximate theoxygen consumption (VO2) and respiration rate. Boththe VO2 and respiration rate have to be normalized rel-atively to their minimal and maximal values. Given thenormalized values, the VO2 and respiration rate are pro-portional to the normalized heart rate [22]. With theVO2 estimation, EPOC can be approximated [6]. Theseestimations, together with the mechanical work and me-chanical power, are then used in the prediction of facialanimations as will be described in the next section.

3.3.2 Predicting Expression ComponentsTo animate a virtual character, the four weights corre-sponding to the blend shapes associated with the basiccomponents identified in the preliminary analysis (seeSection 4.1.1) should be obtained from the movementof the character. Compared to the metabolic parame-ters, the facial expressions in our capture data exhibitedmore sudden and frequent changes. Because of this be-havior, regression trees did not provide adequate resultsto predict blend shape weights.

Instead, we opted for SVM regression, which providedbetter prediction results and had already been used suc-cessfully for facial animation [24]. Tests were con-ducted with multiple participants, for multiple exercisesas well as for single participants and single exercises(see Fig. 3). For a single participant, the predictionof the participant’s blend shapes corresponding to exer-cises not used to train the model was accurate, as shownin Fig. 3(d). Compared to what was observed for themetabolic parameter, the prediction from strength train-

ing exercises (Fig. 3 (b), (c), and (d)) lines up quite wellwith the real data, while the prediction for cardiovascu-lar exercises (Fig. 3(b)) follows the general trend of thecurve, but presents variations of smaller amplitude dueto the regularization of the model.

For multiple participants, training with a single exerciseenabled a relatively good prediction of the same exer-cise for a participant not used in the training data (seeFig. 3(c)). Nevertheless, generalization across all of theparticipants and exercises was relatively poor. Again,the data had to be normalized, but this time with respectto the expressiveness of the participant. This can beseen in Fig. 3(c), as the curves are well aligned, but theblend shape weights are on a different scale. Some ofthe participants could endure incredible exertion with arelatively neutral expression while others depicted pro-nounced expressions. The expressiveness could not belinked to any of the parameters collected about the par-ticipants (age, training level, height, etc.). It still canbe computed for each participants by finding the max-imal weight for each blend shape, resulting in a four-dimensional expressiveness vector. The captured blendshape weights of the participants were then normalized.This expressiveness vector allows the animator to con-trol the facial expressions by increasing or reducing theexpressiveness values.

Given these normalized blend shape weights, the wholedataset could be learned using an SVM. To select thebest-suited set of inputs and model parameters, severalcombinations were evaluated on a test data set and us-ing cross validation. The combination that gave the bestresults was mechanical work, mechanical power, nor-malized heart rate and EPOC as inputs, and blend shapeweight as output. As the predicted heart rate is used asan input for the next prediction (see Fig. 1), the mod-els consider the temporal information and do not onlymodel the correlation at a single-frame level.

With respect to the selection of the SVM parameters,the radial basis function (RBF) kernel was selected, andseveral combinations of parameters were tested: regu-larization parameter and gamma of the RBF varying in-dependently from 10−10 to 1010. The values of theseparameters that produced the best results were differentfrom one blend shape to the other. The regularizationparameter ranged from 103 to 106 while the gamma ofthe RBF ranged from 10−5 to 10−2.

Since different areas of the face reacted in differentways depending on the physical activity and the candi-date, the predictions use four SVM models. Althoughthese are independent, the predictions were consistentin all of our tests. The training RMSE of the modelswas in the [17%,26%] range. The final models enableda good generalization of the captured data, which indi-cates that our method could be used to generate realisticfacial animations on other types of movements.

Page 6: A Machine Learning Approach to Automate Facial ...profs.etsmtl.ca/epaquette/Research/Papers/Boukhalfi.2015/...A Machine Learning Approach to Automate Facial Expressions from Physical

(a) MouthExertion/EPOC (b) Mouth

Breathing/EPOC (c) EyesExertion/EPOC (d) Eyes

Fatigue (e) EyebrowsExertion (f) Eyebrows

Fatigue

Figure 4: The physical effort resulted in a broad range of facial expressions

(a) (b) (c)Figure 5: Results for five minutes of running at mediumspeed with different training levels: (a) 10, (b) 5, (c) 0

4 RESULTSIn this section, the approach and its experimental evalu-ation are discussed in the context of the expected use ofthe models. The accompanying video presents resultsfor different types of exercises. The results present be-lievable expressions based on the physical activity ofthe character, and these expressions improve the real-ism of the 3D character (see Fig. 5).

A prototype was developed to test the approach in realtime and visually validate the resulting predictions. Thelearned regression tree and SVM models were used toanimate facial expressions of a 3D head model. Us-ing the LibSVM implementation to perform the predic-tions showed that the approach can easily achieve realtime frame rates. On an average computer, 60 to 600FPS were obtained depending on the render settings andmesh complexity. The prototype can be used either bytriggering pre-recorded animation clips or by using alive input from a Kinect device. The user can specifythe age, height, weight, fitness level and expressivenessvector of the 3D character, and can also add weightslifted in each hand as well as on the back. The facialexpressions change relatively to the motion of the char-acter and the specified parameters. The user can changethe character parameters and get a real time feedback.Fig. 5 shows different results achieved with differenttraining levels and Fig. 6 shows different results ob-tained by changing the expressiveness vector.

4.1 Discussion4.1.1 Observations on the Captured DataA qualitative analysis of the captured data (full-bodycapture, video and heart rate) was conducted. It ledto several key observations capturing the complex rela-tions between physical activity and facial expressions.

While these observations helped us in the selection ofthe appropriate machine learning methods, they shouldalso benefit artists in animating 3D characters. Twotypes of relations were discovered: general relationsthat hold for most facial expressions, and specific re-lations that apply to particular expressions.

The first general rule linking the physical activity to thefacial expression is that the intensity of expressions isproportional to the displaced mass and inversely pro-portional to the mass of the muscles involved. The ex-pression related to the instantaneous exertion is propor-tional to the mechanical power. The evolution of the ex-pression intensity is proportional to the change in heartrate and metabolic energy. Finally, the recovery timeis proportional to the effort intensity and inversely pro-portional to the training level.

Rules specific to individual components of facial ex-pression were also observed. It was determined that thefacial expression features associated with physical ac-tivity were concentrated around the eyes and the mouth(see Fig. 4). Regarding the expressions related to themouth, the stretching is induced by two factors: instan-taneous physical exertion and fatigue level. The mouthremains closed at the beginning of the training session.After a certain time, it starts to open, and the opening islinked to the respiration rate and the fatigue level.

Other observations were related to the region of theeyes. Eye squinting is mainly induced by instantaneousphysical exertion until a certain fatigue level. At ahigher level of fatigue, the eyes tend to relax in con-nection with the fatigue level with a remaining constantsquinting value. The behavior of the eyebrows is a com-bination of a downward movement related to the physi-cal exertion and an upward movement related to fatigue.

Finally, some observations were made with respect toboth the breathing and the swallowing. The frequencyof occasional breathing movements related to loud andquick expiration is induced by two factors: fatigue leveland respiration rate. The frequency of the occasionalgulping is proportional to the fatigue and to the regularrespiration rate. These observations helped in definingblend shape selection greatly inspired by the musculargroups of the human face, as described in FACS [4].The model consisted of a neutral face and four blend

Page 7: A Machine Learning Approach to Automate Facial ...profs.etsmtl.ca/epaquette/Research/Papers/Boukhalfi.2015/...A Machine Learning Approach to Automate Facial Expressions from Physical

(a) (b) (c) (d) (e)Figure 6: Results obtained with different expressiveness vectors: (a-b) softened expressions, (c) predicted expres-sion, and (d-e) exaggerated expressions. The red lines are added as guides to help in comparing the expressions

(a) Neutral (b) Mouth Stretch (c) Mouth Opening (d) Eyes Squinting (e) EyebrowsFigure 7: The four blend shapes used in our implementation.

shapes that can be used in various expressions linked tophysical activity (see Fig. 7).

4.1.2 Manual Blend Shapes Recovery

As explained earlier, to simplify the capture sessions,only a video recording of the face was used for thefacial expression capture. Different techniques wereevaluated for retrieving facial animation data from therecorded video, such as video tracking software and fa-cial tracking software, but it was found that manual an-imation provided more reliable results. To recover ob-jective values that can be used with machine learningapproaches, a virtual character’s face was key-frame an-imated to match the expression of the participants.

To ensure the results are reproducible, blend shapeswere key-framed, one at a time, and always in thesame order. Furthermore, to measure how perceptuallymeaningful the values were, three different people in-dependently adjusted the blend shape weights for a se-lection of eighteen representative poses. Even thoughthe blend shape weights were not identical, the error re-mained limited to 11% on average and was consideredto be quite sufficient for the purposes of this work.

4.1.3 Limitations

As shown in Fig. 7, the blend shape model used in thiswork is sufficient, but it does not cover the whole rangeof expressions that can be observed. Since each blendshape is predicted separately, there could be inconsis-tencies in the face of the character. Resolving suchinconsistencies and providing a better correspondenceamong the muscular groups could be achieved througha constrained weight propagation [17]. While the mesh

deformation used in this paper is based on blend shapes,the models could be learned with the use of other con-trol mechanisms, such as bone systems.

The generated facial expressions are generalizations ofthe observed data. As such, they correspond to meanvalues and sometimes lack expressiveness as comparedto motion capture (see Fig. 8). The models sometimesoutput results that deviate significantly from the obser-vations. As they happen quite infrequently, and as thesubsequent predictions are in accordance with the cap-tured data, they can be easily filtered out.

The metabolic prediction model uses its last predictionas input. It is thus subject to error accumulation andcould diverge from the observed values over time. Thisshould become noticeable for long animations. Ap-proaches to steer the values back to the observed rangeshould be used to solve such problems.

(a) (b) (c)Figure 8: Comparison between real and predicted facialexpression from one person to another: (a) generated,(b-c) different candidates doing the same exercise.

Using SVM regression proved to be an effective way ofpredicting facial expressions. Nevertheless, it remainsa “black box” technique that does not provide many op-tions for customization.

Page 8: A Machine Learning Approach to Automate Facial ...profs.etsmtl.ca/epaquette/Research/Papers/Boukhalfi.2015/...A Machine Learning Approach to Automate Facial Expressions from Physical

5 CONCLUSIONCreating realistic facial animation based on physicalactivity is a challenging task. By analyzing two setsof captured data, this paper reveals several importantobservations about what triggers specific facial expres-sions. Based on the captured data, a combination oftwo machine learning techniques was used in order toautomatically synthesize some metabolic parameters aswell as the facial animation of a 3D character. Whilebeing automatic, this approach provides meaningful pa-rameters that animators can change to deliver realisticand compelling facial animations that automatically ad-just to the motion of the character. Furthermore, themetabolic parameters provided by the approach couldalso be helpful in animating other aspects of the char-acter, such as breathing and sweating. Finally, the ap-proach can be used for real-time applications as well asoff-line high quality rendering. The approach providesmore realistic characters while reducing the burden ofcapturing or hand animating the facial expressions re-sulting from physical activity.

Some limitations were noticed throughout the develop-ment of the proposed approach. Like other methods de-scribed in Section 2, the proposed approach addressesa single aspect of facial animation (only from physicalactivity). A future work would be to provide a frame-work that allows mixing different types of expressions.

While the proposed approach does provide realistic re-sults, it is deterministic in nature: given the same con-trol parameters and motions, it will result in the samefacial animation. An interesting avenue for future re-search would be to incorporate the probabilistic andstochastic nature of human reactions into the models.

6 ACKNOWLEDGMENTSThis work was funded by the GRAND NCE

7 REFERENCES[1] A. Arya and A. DiPaola, S. Parush. Perceptually Valid Facial

Expressions for Character-based Applications. InternationalJournal of Computer Games Technology.

[2] Paul C. DiLorenzo, Victor B. Zordan, and Benjamin L. Sanders.Laughing out loud: control for modeling anatomically inspiredlaughter using audio. In ACM SIGGRAPH Asia 2008 papers,pages 125:1–125:8, 2008.

[3] Peter Eisert, Subhasis Chaudhuri, and Bernd Girod. Speechdriven synthesis of talking head sequences. In 3D Image Anal-ysis and Synthesis, pages 51–56, 1997.

[4] P. Ekman and W.V. Friesen. Facial action coding system: in-vestigator’s guide. Consulting Psychologists Press, 1978.

[5] G. Faigin. The Artist’s Complete Guide to Facial Expression.Watson-Guptill Publications, 1990.

[6] Firstbeat Technologies. Indirect epoc prediction method basedon heart rate measurement. White paper.

[7] E. Goldfinger. Human Anatomy for Artists: The Elements ofForm. Oxford University Press, 1991.

[8] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,and I.H. Witten. The weka data mining software: An update.SIGKDD Explorations, 11:10–18, November 2009.

[9] A.M. Kring and D.M. Sloan. The facial expression codingsystem (faces): Development, validation, and utility. Psycho-logical Assessment, 19(2):210–224, June 2007.

[10] S. Kshirsagar and N. Magnenat-Thalmann. Visyllable basedspeech animation. Computer Graphics Forum, 22(3):631–639,2003.

[11] S. Kshirsagar, T. Molet, and N. Magnenat-Thalmann. Princi-pal components of expressive speech animation. In Proc. ofComputer Graphics International 2001, pages 38–44, 2001.

[12] S.Z. Li, A.K. Jain, Y.-L. Tian, T. Kanade, and J.F. Cohn. Facialexpression analysis. In Handbook of Face Recognition, pages247–275. Springer New York, 2005.

[13] I.B. Mauss and M.D. Robinson. Measures of emotion: A re-view. Cognition & Emotion, 23(2):209–237, 2009.

[14] R.T. McKenzie. The facial expression of violent effort, breath-lessness, and fatigue. Journal of anatomy and physiology,40:51–56, October 1905.

[15] W. Zhen P. Hong and T.S. Huang. Real-time speech-drivenface animation with expressions using neural networks. NeuralNetworks, IEEE Transactions on, 13(4):916–927, Jul 2002.

[16] C. Puklavage, A. Pirela, A.J. Gonzalez, and M. Georgiopoulos.Imitating personalized expressions through an avatar using ma-chine learning. In Proc. of Intl. Florida Artificial IntelligenceResearch Society Conference (FLAIRS), pages 68–73. Associa-tion for the Advancement of Artificial Intelligence, May 2010.

[17] L. Qing and D. Zhigang. Orthogonal-blendshape-based editingsystem for facial motion capture data. Computer Graphics andApplications, IEEE, 28(6):76–82, 2008.

[18] J.A. Russell. A circumplex model of affect. Journal of Person-ality and Social Psychology, 39(6):1161–1178, 1980.

[19] Z. Ruttkay, H. Noot, and P. Ten Hagen. Emotion disc andemotion squares: Tools to explore the facial expression space.Computer Graphics Forum, 22(1):49–53, 2003.

[20] M. Sagar. Facial performance capture and expressive transla-tion for king kong. In ACM SIGGRAPH 2006 Sketches, 2006.

[21] E. Sifakis, A. Selle, A. Robinson-Mosher, and R. Fedkiw. Sim-ulating speech with a physics-based facial muscle model. InProc. of 2006 ACM SIGGRAPH/Eurographics symposium onComputer animation, pages 261–270, 2006.

[22] D.P. Swain and B.C. Leutholtz. Heart rate reserve is equiva-lent to %vo2 reserve, not to %vo2max. Med Sci Sports Exerc,29:410–4, March 1997.

[23] R.E. Thayer. The biopsychology of mood and arousal. OxfordUniversity Press, 1989.

[24] P. Faloutsos Y. Cao, W.C. Tien and F. Pighin. Expressivespeech-driven facial animation. ACM Trans. Graph., 24:1283–1302, October 2005.

[25] Y.Q. Xu E. Chang H.Y. Shum Y. Li, F. Yu. Speech-drivencartoon animation with emotions. In Proceedings of ACM In-ternational Conference on Multimedia, MULTIMEDIA ’01,pages 365–371, 2001.

[26] Victor B. Zordan, Bhrigu Celly, Bill Chiu, and Paul C.DiLorenzo. Breathe easy: Model and control of human respi-ration for computer animation. Graph. Models, 68(2):113–132,March 2006.