ICMI 2016
18th ACM International Conference on Multimodal Interaction (ICMI 2016)
Powered by
Conference Publishing Consulting

18th ACM International Conference on Multimodal Interaction (ICMI 2016), November 12–16, 2016, Tokyo, Japan

ICMI 2016 – Proceedings

Contents - Abstracts - Authors

EmotiW Challenge
Sat, Nov 12, 09:00 - 17:30, Time24: Room 183

EmotiW 2016: Video and Group-Level Emotion Recognition Challenges
Abhinav Dhall, Roland Goecke, Jyoti Joshi, Jesse Hoey, and Tom Gedeon
(University of Waterloo, Canada; University of Canberra, Australia; Australian National University, Australia)
This paper discusses the baseline for the Emotion Recognition in the Wild (EmotiW) 2016 challenge. Continuing on the theme of automatic affect recognition `in the wild', the EmotiW challenge 2016 consists of two sub-challenges: an audio-video based emotion and a new group-based emotion recognition sub-challenges. The audio-video based sub-challenge is based on the Acted Facial Expressions in the Wild (AFEW) database. The group-based emotion recognition sub-challenge is based on the Happy People Images (HAPPEI) database. We describe the data, baseline method, challenge protocols and the challenge results. A total of 22 and 7 teams participated in the audio-video based emotion and group-based emotion sub-challenges, respectively.

Info
Emotion Recognition in the Wild from Videos using Images
Sarah Adel Bargal, Emad Barsoum, Cristian Canton Ferrer, and Cha Zhang
(Boston University, USA; Microsoft Research, USA)
This paper presents the implementation details of the proposed solution to the Emotion Recognition in the Wild 2016 Challenge, in the category of video-based emotion recognition. The proposed approach takes the video stream from the audio-video trimmed clips provided by the challenge as input and produces the emotion label corresponding to this video sequence. This output is encoded as one out of seven classes: the six basic emotions (Anger, Disgust, Fear, Happiness, Sad, Surprise) and Neutral. Overall, the system consists of several pipelined modules: face detection, image pre-processing, deep feature extraction, feature encoding and, finally, an SVM classification.
This system achieves 59.42% validation accuracy, surpassing the competition baseline of 38.81%. With regard to test data, our system achieves 56.66% recognition rate, also improving the competition baseline of 40.47%.

A Deep Look into Group Happiness Prediction from Images
Aleksandra Cerekovic
We propose a Deep Neural Network (DNN) architecture to predict happiness displayed by a group of people in images. Our approach exploits both image context and individual facial information extracted from an image. The latter is explicitly modeled by Long Short Term Memory networks (LSTMs) encoding face happiness intensity and the spatial distribution of faces forming a group. LSTM outputs are combined with image context descriptors, to obtain the final group happiness score. We thoroughly evaluate our approach on the recently proposed HAPPEI dataset for group happiness prediction. Our results show that the proposed architecture outperforms a baseline CNN trained to predict group happiness directly from an image. Our method also shows an improvement of approximately 30% over the HAPPEI dataset’s baseline on the validation set.

Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu
(iQiyi, China)
In this paper, we present a video-based emotion recognition system submitted to the EmotiW 2016 Challenge. The core module of this system is a hybrid network that combines recurrent neural network (RNN) and 3D convolutional networks (C3D) in a late-fusion fashion. RNN and C3D encode appearance and motion information in different ways. Specifically, RNN takes appearance features extracted by convolutional neural network (CNN) over individual video frames as input and encodes motion later, while C3D models appearance and motion of video simultaneously. Combined with an audio module, our system achieved a recognition accuracy of 59.02% without using any additional emotion-labeled video clips in training set, compared to 53.8% of the winner of EmotiW 2015. Extensive experiments show that combining RNN and C3D together can improve video-based emotion recognition noticeably.

Info
LSTM for Dynamic Emotion and Group Emotion Recognition in the Wild
Bo Sun, Qinglan Wei, Liandong Li, Qihua Xu, Jun He, and Lejun Yu
(Beijing Normal University, China)
In this paper, we describe our work in the fourth Emotion Recognition in the Wild (EmotiW 2016) Challenge. For video based emotion recognition sub-challenge, we extract acoustic features, LBPTOP, Dense SIFT and CNN-LSTM features to recognize the emotions of film characters. For group level emotion recognition sub-challenge, we use LSTM and GEM model. We train linear SVM classifiers for these kinds of features on the AFEW6.0 and HAPPEI dataset, and use a fusion network we proposed to combine all the extracted features at the decision level. The final achievements we have gained are 51.54% accuracy on the AFEW testing set and 0.836 RMSE on the HAPPEI testing set.

Multi-clue Fusion for Emotion Recognition in the Wild
Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, and Ning Sun
(Southeast University, China; Nanjing University of Posts and Telecommunications, China)
In the past three years, Emotion Recognition in the Wild (EmotiW) Grand Challenge has drawn more and more attention due to its huge potential applications. In the fourth challenge, aimed at the task of video based emotion recognition, we propose a multi-clue emotion fusion (MCEF) framework by modeling human emotion from three mutually complementary sources, facial appearance texture, facial action, and audio. To extract high-level emotion features from sequential face images, we employ a CNN-RNN architecture, where face image from each frame is first fed into the fine-tuned VGG-Face network to extract face feature, and then the features of all frames are sequentially traversed in a bidirectional RNN so as to capture dynamic changes of facial textures. To attain more accurate facial actions, a facial landmark trajectory model is proposed to explicitly learn emotion variations of facial components. Further, audio signals are also modeled in a CNN framework by extracting low-level energy features from segmented audio clips and then stacking them as an image-like map. Finally, we fuse the results generated from three clues to boost the performance of emotion recognition. Our proposed MCEF achieves an overall accuracy of 56.66% with a large improvement of 16.19% with respect to the baseline.

Multi-view Common Space Learning for Emotion Recognition in the Wild
Jianlong Wu, Zhouchen Lin, and Hongbin Zha
(Peking University, China; Shanghai Jiao Tong University, China)
It is a very challenging task to recognize emotion in the wild. Recently, combining information from various views or modalities has attracted more attention. Cross modality features and features extracted by different methods are regarded as multi-view information of the sample. In this paper, we propose a method to analyse multi-view features of emotion samples and automatically recognize the expression as part of the fourth Emotion Recognition in the Wild Challenge (EmotiW 2016). In our method, we first extract multi-view features such as BoF, CNN, LBP-TOP and audio features for each expression sample. Then we learn the corresponding projection matrices to map multi-view features into a common subspace. In the meantime, we impose l_{2,1}-norm penalties on projection matrices for feature selection. We apply both this method and PLSR to emotion recognition. We conduct experiments on both AFEW and HAPPEI datasets, and achieve superior performance. The best recognition accuracy of our method is 0.5531 on the AFEW dataset for video based emotion recognition in the wild. The minimum RMSE for group happiness intensity recognition is 0.9525 on HAPPEI dataset. Both of them are much better than that of the challenge baseline.

HoloNet: Towards Robust Emotion Recognition in the Wild
Anbang Yao, Dongqi Cai, Ping Hu, Shandong Wang, Liang Sha, and Yurong Chen
(Intel Labs, China; Beihang University, China)
In this paper, we present HoloNet, a well-designed Convolutional Neural Network (CNN) architecture regarding our submissions to the video based sub-challenge of the Emotion Recognition in the Wild (EmotiW) 2016 challenge. In contrast to previous related methods that usually adopt relatively simple and shallow neural network architectures to address emotion recognition task, our HoloNet has three critical considerations in network design. (1) To reduce redundant filters and enhance the non-saturated non-linearity in the lower convolutional layers, we use a modified Concatenated Rectified Linear Unit (CReLU) instead of ReLU. (2) To enjoy the accuracy gain from considerably increased network depth and maintain efficiency, we combine residual structure and CReLU to construct the middle layers. (3) To broaden network width and introduce multi-scale feature extraction property, the topper layers are designed as a variant of inception-residual structure. The main benefit of grouping these modules into the HoloNet is that both negative and positive phase information implicitly contained in the input data can flow over it in multiple paths, thus deep multi-scale features explicitly capturing emotion variation can be well extracted from multi-path sibling layers, and then can be further concatenated for robust recognition. We obtain competitive results in this year’s video based emotion recognition sub-challenge using an ensemble of two HoloNet models trained with given data only. Specifically, we obtain a mean recognition rate of 57.84%, outperforming the baseline accuracy with an absolute margin of 17.37%, and yielding 4.04% absolute accuracy gain compared to the result of last year’s winner team. Meanwhile, our method runs with a speed of several thousands of frames per second on a GPU, thus it is well applicable to real-time scenarios.

Group Happiness Assessment using Geometric Features and Dataset Balancing
Vassilios Vonikakis, Yasin Yazici, Viet Dung Nguyen, and Stefan Winkler
(Advanced Digital Sciences Center at University of Illinois, Singapore; Nanyang Technological University, Singapore)
This paper presents the techniques employed in our team's submissions to the 2016 Emotion Recognition in the Wild contest, for the sub-challenge of group-level emotion recognition. The objective of this sub-challenge is to estimate the happiness intensity of groups of people in consumer photos. We follow a predominately bottom-up approach, in which the individual happiness level of each face is estimated separately. The proposed technique is based on geometric features derived from 49 facial points. These features are used to train a model on a subset of the HAPPEI dataset, balanced across expression and headpose, using Partial Least Squares regression. The trained model exhibits competitive performance for a range of non-frontal poses, while at the same time offering a semantic interpretation of the facial distances that may contribute positively or negatively to group-level happiness. Various techniques are explored in combining these estimations in order to perform group-level prediction, including the distribution of expressions, significance of a face relative to the whole group, and mean estimation. Our best submission achieves an RMSE of 0.8316 on the competition test set, which compares favorably to the RMSE of 1.30 of the baseline.

Video
Happiness Level Prediction with Sequential Inputs via Multiple Regressions
Jianshu Li, Sujoy Roy, Jiashi Feng, and Terence Sim
(National University of Singapore, Singapore; SAP, Singapore)
This paper presents our solution submitted to the Emotion Recognition in the Wild (EmotiW 2016) group-level happiness intensity prediction sub-challenge. The objective of this sub-challenge is to predict the overall happiness level given an image of a group of people in a natural setting. We note that both the global setting and the faces of the individuals in the image influence the group-level happiness intensity of the image. Hence the challenge lies in building a solution that incorporates both these factors and also considers their right combination. Our proposed solution incorporates both these factors as a combination of global and local information. We use a convolutional neural network to extract discriminative face features, and a recurrent neural network to selectively memorize the important features to perform the group-level happiness prediction task. Experimental evaluations show promising performance improvements, resulting in Root Mean Square Error (RMSE) reduction of about 0.5 units on the test set compared to the baseline algorithm that uses only global information.

Video Emotion Recognition in the Wild Based on Fusion of Multimodal Features
Shizhe Chen, Xinrui Li, Qin Jin, Shilei Zhang, and Yong Qin
(Renmin University of China, China; IBM Research, China)
In this paper, we present our methods to the Audio-Video Based Emotion Recognition subtask in the 2016 Emotion Recognition in the Wild (EmotiW) Challenge. The task is to predict one of the seven basic emotions for the characters in the video clips extracted from movies or TV shows. In our approach, we explore various multimodal features from audio, facial image and video motion modalities. The audio features contain statistical acoustic features, MFCC Bag-of-Audio-Words and MFCC Fisher Vectors. For image related features, we extract hand-crafted features (LBP-TOP and SPM Dense SIFT) and learned features (CNN features). The improved Dense Trajectory is used as the motion related features. We train SVM, Random Forest and Logistic Regression classifiers for each kind of feature. Among them, MFCC fisher vector is the best acoustic features and the facial CNN feature is the most discriminative feature for emotion recognition. We utilize late fusion to combine different modality features and achieve a 50.76% accuracy on the testing set, which significantly outperforms the baseline test accuracy of 40.47%.

Wild Wild Emotion: A Multimodal Ensemble Approach
John Gideon, Biqiao Zhang, Zakaria Aldeneh, Yelin Kim, Soheil Khorram, Duc Le, and Emily Mower Provost
(University of Michigan, USA; SUNY Albany, USA)
Automatic emotion recognition from audio-visual data is a topic that has been broadly explored using data captured in the laboratory. However, these data are not necessarily representative of how emotion is manifested in the real-world. In this paper, we describe our system for the 2016 Emotion Recognition in the Wild challenge. We use the Acted Facial Expressions in the Wild database 6.0 (AFEW 6.0), which contains short clips of popular TV shows and movies and has more variability in the data compared to laboratory recordings. We explore a set of features that incorporate information from facial expressions and speech, in addition to cues from the background music and overall scene. In particular, we propose the use of a feature set composed of dimensional emotion estimates trained from outside acoustic corpora. We design sets of multiclass and pairwise (one-versus-one) classifiers and fuse the resulting systems. Our fusion increases the performance from a baseline of 38.81% to 43.86% and from 40.47% to 46.88%, for validation and test sets, respectively. While the video features perform better than audio features alone, a combination of the two modalities achieves the greatest performance, with gains of 4.4% and 1.4%, with and without information gain, respectively. Because of the flexible design of the fusion, it is easily adaptable to other multimodal learning problems.

Audio and Face Video Emotion Recognition in the Wild using Deep Neural Networks and Small Datasets
Wan Ding, Mingyu Xu, Dongyan Huang, Weisi Lin, Minghui Dong, Xinguo Yu, and Haizhou Li
(Central China Normal University, China; University of British Columbia, Canada; A*STAR, Singapore; Nanyang Technological University, Singapore; National University of Singapore, Singapore)
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2016’s video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear & disgust) and neutral. Compared to earlier years’ movie based datasets, this year’s test dataset introduced reality TV videos containing more spontaneous emotion. Our proposed solution is the fusion of facial expression recognition and audio emotion recognition subsystems at score level. For facial emotion recognition, starting from a network pre-trained on ImageNet training data, a deep Convolutional Neural Network is fine-tuned on FER2013 training data for feature extraction. The classifiers, i.e., kernel SVM, logistic regression and partial least squares are studied for comparison. An optimal fusion of classifiers learned from different kernels is carried out at the score level to improve system performance. For audio emotion recognition, a deep Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) is trained directly using the challenge dataset. Experimental results show that both subsystems individually and as a whole can achieve state-of-the art performance. The overall accuracy of the proposed approach on the challenge test dataset is 53.9%, which is better than the challenge baseline of 40.47% .

Automatic Emotion Recognition in the Wild using an Ensemble of Static and Dynamic Representations
Mostafa Mehdipour Ghazi and Hazım Kemal Ekenel
(Sabanci University, Turkey; Istanbul Technical University, Turkey)
Automatic emotion recognition in the wild video datasets is a very challenging problem because of the inter-class similarities among different facial expressions and large intra-class variabilities due to the significant changes in illumination, pose, scene, and expression. In this paper, we present our proposed method for video-based emotion recognition in the EmotiW 2016 challenge. The task considers the unconstrained emotion recognition problem by training from short video clips extracted from movies and testing on short movie clips and spontaneous video clips of the reality TV data. Four different methods are employed to extract both static and dynamic emotion representations from the videos. First, local binary patterns of three orthogonal planes are used to describe spatiotemporal features of the video frames. Second, principal component analysis is applied to the image patches in a two-stage convolutional network to learn weights and extract facial features from the aligned faces. Third, the deep convolutional neural network model of VGG-Face is deployed to extract deep facial representations from aligned faces. Fourth, a bag of visual words is computed based on dense scale-invariant feature transform descriptors from aligned face images to form hand-crafted representations. Support vector machines are then utilized to train and classify the obtained spatiotemporal representations and facial features. Finally, score-level fusion is applied to combine the classification results and predict the emotion labels of the video clips. The results show that the proposed combined method has outperformed all the utilized techniques with the overall validations and test accuracies of 43.13% and 40.13%, respectively. This system, is relatively a good classifier in Happy and Angry emotion categories and is unsuccessful in detecting Surprise, Disgust, and Fear.

proc time: 1.25