ICMI 2017 – Author Index |
Contents -
Abstracts -
Authors
|
A B C D G H J K L M O P Q R S T V W X Y Z
Abbas, Asad |
ICMI '17-EMOTIW: "Group Emotion Recognition ..."
Group Emotion Recognition in the Wild by Combining Deep Neural Networks for Facial Expression Classification and Scene-Context Analysis
Asad Abbas and Stephan K. Chalup (University of Newcastle, Australia) This paper presents the implementation details of a proposed solution to the Emotion Recognition in the Wild 2017 Challenge, in the category of group-level emotion recognition. The objective of this sub-challenge is to classify a group's emotion as Positive, Neutral or Negative. Our proposed approach incorporates both image context and facial information extracted from an image for classification. We use Convolutional Neural Networks (CNNs) to predict facial emotions from detected faces present in an image. Predicted facial emotions are combined with scene-context information extracted by another CNN using fully connected neural network layers. Various techniques are explored by combining and training these two Deep Neural Network models in order to perform group-level emotion recognition. We evaluate our approach on the Group Affective Database 2.0 provided with the challenge. Experimental evaluations show promising performance improvements, resulting in approximately 37% improvement over the competition's baseline model on the validation dataset. @InProceedings{ICMI17p561, author = {Asad Abbas and Stephan K. Chalup}, title = {Group Emotion Recognition in the Wild by Combining Deep Neural Networks for Facial Expression Classification and Scene-Context Analysis}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {561--568}, doi = {}, year = {2017}, } |
|
Ahmed, Olfa Ben |
ICMI '17-EMOTIW: "Modeling Multimodal Cues in ..."
Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild
Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, and Benoit Huet (University of Modena and Reggio Emilia, Italy; EURECOM, France) In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81 % and 40.47 %): we achieve an accuracy of 50.39 % and 49.92 % respectively on the validation and the testing data. @InProceedings{ICMI17p536, author = {Stefano Pini and Olfa Ben Ahmed and Marcella Cornia and Lorenzo Baraldi and Rita Cucchiara and Benoit Huet}, title = {Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {536--543}, doi = {}, year = {2017}, } |
|
Balaji, B. |
ICMI '17-EMOTIW: "Multi-Level Feature Fusion ..."
Multi-Level Feature Fusion for Group-Level Emotion Recognition
B. Balaji and V. Ramana Murthy Oruganti (Amrita University at Coimbatore, India) In this paper, the influence of low-level and mid-level features is investigated for image-based group emotion recognition. We hypothesize that the human faces, and the objects surrounding them are major sources of information and thus can serve as mid-level features. Hence, we detect faces and objects using pre-trained Deep Net models. Information from different layers in conjunction with different encoding techniques is extensively investigated to obtain the richest feature vectors. The best result obtained classification accuracy of 65.0% on the validation set, is submitted to the Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. The best feature vector yielded 75.1% on the testing set. Post competition, few more experiments were performed and included the same @InProceedings{ICMI17p583, author = {B. Balaji and V. Ramana Murthy Oruganti}, title = {Multi-Level Feature Fusion for Group-Level Emotion Recognition}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {583--586}, doi = {}, year = {2017}, } |
|
Baraldi, Lorenzo |
ICMI '17-EMOTIW: "Modeling Multimodal Cues in ..."
Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild
Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, and Benoit Huet (University of Modena and Reggio Emilia, Italy; EURECOM, France) In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81 % and 40.47 %): we achieve an accuracy of 50.39 % and 49.92 % respectively on the validation and the testing data. @InProceedings{ICMI17p536, author = {Stefano Pini and Olfa Ben Ahmed and Marcella Cornia and Lorenzo Baraldi and Rita Cucchiara and Benoit Huet}, title = {Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {536--543}, doi = {}, year = {2017}, } |
|
Barner, Kenneth E. |
ICMI '17-EMOTIW: "Group-Level Emotion Recognition ..."
Group-Level Emotion Recognition using Deep Models on Image Scene, Faces, and Skeletons
Xin Guo, Luisa F. Polanía, and Kenneth E. Barner (University of Delaware, USA; American Family Mutual Insurance Company, USA) This paper presents the work submitted to the Group-level Emotion Recognition sub-challenge, which is part of the 5th Emotion Recognition in the Wild (EmotiW 2017) Challenge. The task of this sub-challenge is to classify the emotion of a group of people in each image as positive, neutral or negative. To address this task, a hybrid network that incorporates global scene features, skeleton features of the group, and local facial features is developed. Specifically, deep convolutional neural networks (CNNs) are first trained on the faces of the group, the whole images and the skeletons of the group, and then fused to perform the group-level emotion prediction. Experimental results show that the proposed network achieves 80.05% and 80.61% on the validation and testing sets, respectively, outperforming the baseline of 52.97% and 53.62%. @InProceedings{ICMI17p603, author = {Xin Guo and Luisa F. Polanía and Kenneth E. Barner}, title = {Group-Level Emotion Recognition using Deep Models on Image Scene, Faces, and Skeletons}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {603--608}, doi = {}, year = {2017}, } |
|
Battini Sönmez, Elena |
ICMI '17-EMOTIW: "Emotion Recognition in the ..."
Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers
Luca Surace, Massimiliano Patacchiola, Elena Battini Sönmez, William Spataro, and Angelo Cangelosi (University of Calabria, Italy; Plymouth University, UK; Instanbul Bilgi University, Turkey) Group emotion recognition in the wild is a challenging problem, due to the unstructured environments in which everyday life pictures are taken. Some of the obstacles for an effective classification are occlusions, variable lighting conditions, and image quality. In this work we present a solution based on a novel combination of deep neural networks and Bayesian classifiers. The neural network works on a bottom-up approach, analyzing emotions expressed by isolated faces. The Bayesian classifier estimates a global emotion integrating top-down features obtained through a scene descriptor. In order to validate the system we tested the framework on the dataset released for the Emotion Recognition in the Wild Challenge 2017. Our method achieved an accuracy of 64.68% on the test set, significantly outperforming the 53.62% competition baseline. @InProceedings{ICMI17p593, author = {Luca Surace and Massimiliano Patacchiola and Elena Battini Sönmez and William Spataro and Angelo Cangelosi}, title = {Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {593--597}, doi = {}, year = {2017}, } |
|
Cai, Dongqi |
ICMI '17-EMOTIW: "Learning Supervised Scoring ..."
Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen (Intel Labs, China) State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records. @InProceedings{ICMI17p553, author = {Ping Hu and Dongqi Cai and Shandong Wang and Anbang Yao and Yurong Chen}, title = {Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {553--560}, doi = {}, year = {2017}, } |
|
Cangelosi, Angelo |
ICMI '17-EMOTIW: "Emotion Recognition in the ..."
Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers
Luca Surace, Massimiliano Patacchiola, Elena Battini Sönmez, William Spataro, and Angelo Cangelosi (University of Calabria, Italy; Plymouth University, UK; Instanbul Bilgi University, Turkey) Group emotion recognition in the wild is a challenging problem, due to the unstructured environments in which everyday life pictures are taken. Some of the obstacles for an effective classification are occlusions, variable lighting conditions, and image quality. In this work we present a solution based on a novel combination of deep neural networks and Bayesian classifiers. The neural network works on a bottom-up approach, analyzing emotions expressed by isolated faces. The Bayesian classifier estimates a global emotion integrating top-down features obtained through a scene descriptor. In order to validate the system we tested the framework on the dataset released for the Emotion Recognition in the Wild Challenge 2017. Our method achieved an accuracy of 64.68% on the test set, significantly outperforming the 53.62% competition baseline. @InProceedings{ICMI17p593, author = {Luca Surace and Massimiliano Patacchiola and Elena Battini Sönmez and William Spataro and Angelo Cangelosi}, title = {Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {593--597}, doi = {}, year = {2017}, } |
|
Chalup, Stephan K. |
ICMI '17-EMOTIW: "Group Emotion Recognition ..."
Group Emotion Recognition in the Wild by Combining Deep Neural Networks for Facial Expression Classification and Scene-Context Analysis
Asad Abbas and Stephan K. Chalup (University of Newcastle, Australia) This paper presents the implementation details of a proposed solution to the Emotion Recognition in the Wild 2017 Challenge, in the category of group-level emotion recognition. The objective of this sub-challenge is to classify a group's emotion as Positive, Neutral or Negative. Our proposed approach incorporates both image context and facial information extracted from an image for classification. We use Convolutional Neural Networks (CNNs) to predict facial emotions from detected faces present in an image. Predicted facial emotions are combined with scene-context information extracted by another CNN using fully connected neural network layers. Various techniques are explored by combining and training these two Deep Neural Network models in order to perform group-level emotion recognition. We evaluate our approach on the Group Affective Database 2.0 provided with the challenge. Experimental evaluations show promising performance improvements, resulting in approximately 37% improvement over the competition's baseline model on the validation dataset. @InProceedings{ICMI17p561, author = {Asad Abbas and Stephan K. Chalup}, title = {Group Emotion Recognition in the Wild by Combining Deep Neural Networks for Facial Expression Classification and Scene-Context Analysis}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {561--568}, doi = {}, year = {2017}, } |
|
Chen, Shizhe |
ICMI '17-EMOTIW: "Emotion Recognition with Multimodal ..."
Emotion Recognition with Multimodal Features and Temporal Models
Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, and Yong Qin (Renmin University of China, China; IBM Research Lab, China) This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods. @InProceedings{ICMI17p598, author = {Shuai Wang and Wenxuan Wang and Jinming Zhao and Shizhe Chen and Qin Jin and Shilei Zhang and Yong Qin}, title = {Emotion Recognition with Multimodal Features and Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {598--602}, doi = {}, year = {2017}, } |
|
Chen, Yurong |
ICMI '17-EMOTIW: "Learning Supervised Scoring ..."
Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen (Intel Labs, China) State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records. @InProceedings{ICMI17p553, author = {Ping Hu and Dongqi Cai and Shandong Wang and Anbang Yao and Yurong Chen}, title = {Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {553--560}, doi = {}, year = {2017}, } |
|
Choi, Dong Yoon |
ICMI '17-EMOTIW: "Multi-modal Emotion Recognition ..."
Multi-modal Emotion Recognition using Semi-supervised Learning and Multiple Neural Networks in the Wild
Dae Ha Kim, Min Kyu Lee, Dong Yoon Choi, and Byung Cheol Song (Inha University, South Korea) Human emotion recognition is a research topic that is receiving continuous attention in computer vision and artificial intelligence domains. This paper proposes a method for classifying human emotions through multiple neural networks based on multi-modal signals which consist of image, landmark, and audio in a wild environment. The proposed method has the following features. First, the learning performance of the image-based network is greatly improved by employing both multi-task learning and semi-supervised learning using the spatio-temporal characteristic of videos. Second, a model for converting 1-dimensional (1D) landmark information of face into two-dimensional (2D) images, is newly proposed, and a CNN-LSTM network based on the model is proposed for better emotion recognition. Third, based on an observation that audio signals are often very effective for specific emotions, we propose an audio deep learning mechanism robust to the specific emotions. Finally, so-called emotion adaptive fusion is applied to enable synergy of multiple networks. In the fifth attempt on the given test set in the EmotiW2017 challenge, the proposed method achieved a classification accuracy of 57.12%. @InProceedings{ICMI17p529, author = {Dae Ha Kim and Min Kyu Lee and Dong Yoon Choi and Byung Cheol Song}, title = {Multi-modal Emotion Recognition using Semi-supervised Learning and Multiple Neural Networks in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {529--535}, doi = {}, year = {2017}, } |
|
Cornia, Marcella |
ICMI '17-EMOTIW: "Modeling Multimodal Cues in ..."
Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild
Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, and Benoit Huet (University of Modena and Reggio Emilia, Italy; EURECOM, France) In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81 % and 40.47 %): we achieve an accuracy of 50.39 % and 49.92 % respectively on the validation and the testing data. @InProceedings{ICMI17p536, author = {Stefano Pini and Olfa Ben Ahmed and Marcella Cornia and Lorenzo Baraldi and Rita Cucchiara and Benoit Huet}, title = {Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {536--543}, doi = {}, year = {2017}, } |
|
Cucchiara, Rita |
ICMI '17-EMOTIW: "Modeling Multimodal Cues in ..."
Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild
Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, and Benoit Huet (University of Modena and Reggio Emilia, Italy; EURECOM, France) In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81 % and 40.47 %): we achieve an accuracy of 50.39 % and 49.92 % respectively on the validation and the testing data. @InProceedings{ICMI17p536, author = {Stefano Pini and Olfa Ben Ahmed and Marcella Cornia and Lorenzo Baraldi and Rita Cucchiara and Benoit Huet}, title = {Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {536--543}, doi = {}, year = {2017}, } |
|
Dhall, Abhinav |
ICMI '17-EMOTIW: "From Individual to Group-Level ..."
From Individual to Group-Level Emotion Recognition: EmotiW 5.0
Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon (IIT Ropar, India; University of Canberra, Australia; University of Waterloo, Canada; Australian National University, Australia) Research in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects of affective computing. This year there are two sub-challenges: a) Audio-video emotion recognition and b) group-level emotion recognition. These challenges are based on the acted facial expressions in the wild and group affect databases, respectively. The particular focus of the challenge is to evaluate method in `in the wild' settings. `In the wild' here is used to describe the various environments represented in the images and videos, which represent real-world (not lab like) scenarios. The baseline, data, protocol of the two challenges and the challenge participation are discussed in detail in this paper. @InProceedings{ICMI17p524, author = {Abhinav Dhall and Roland Goecke and Shreya Ghosh and Jyoti Joshi and Jesse Hoey and Tom Gedeon}, title = {From Individual to Group-Level Emotion Recognition: EmotiW 5.0}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {524--528}, doi = {}, year = {2017}, } Info |
|
Ding, Wan |
ICMI '17-EMOTIW: "Audio-Visual Emotion Recognition ..."
Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models
Xi Ouyang, Shigenori Kawaai, Ester Gue Hua Goh, Shengmei Shen, Wan Ding, Huaiping Ming, and Dong-Yan Huang (Panasonic R&D Center, Singapore; Central China Normal University, China; Institute for Infocomm Research at A*STAR, Singapore) This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenges for the wild emotion recognition. Deep network transfer learning is used for feature extraction. Spatial-temporal model fusion is to make full use of the complementary of different networks. Semi-auto reinforcement learning is for the optimization of fusion strategy based on dynamic outside feedbacks given by challenge organizers. The overall accuracy of the proposed approach on the challenge test dataset is 57.2%, which is better than the challenge baseline of 40.47% . @InProceedings{ICMI17p577, author = {Xi Ouyang and Shigenori Kawaai and Ester Gue Hua Goh and Shengmei Shen and Wan Ding and Huaiping Ming and Dong-Yan Huang}, title = {Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {577--582}, doi = {}, year = {2017}, } |
|
Gedeon, Tom |
ICMI '17-EMOTIW: "From Individual to Group-Level ..."
From Individual to Group-Level Emotion Recognition: EmotiW 5.0
Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon (IIT Ropar, India; University of Canberra, Australia; University of Waterloo, Canada; Australian National University, Australia) Research in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects of affective computing. This year there are two sub-challenges: a) Audio-video emotion recognition and b) group-level emotion recognition. These challenges are based on the acted facial expressions in the wild and group affect databases, respectively. The particular focus of the challenge is to evaluate method in `in the wild' settings. `In the wild' here is used to describe the various environments represented in the images and videos, which represent real-world (not lab like) scenarios. The baseline, data, protocol of the two challenges and the challenge participation are discussed in detail in this paper. @InProceedings{ICMI17p524, author = {Abhinav Dhall and Roland Goecke and Shreya Ghosh and Jyoti Joshi and Jesse Hoey and Tom Gedeon}, title = {From Individual to Group-Level Emotion Recognition: EmotiW 5.0}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {524--528}, doi = {}, year = {2017}, } Info |
|
Ghosh, Shreya |
ICMI '17-EMOTIW: "From Individual to Group-Level ..."
From Individual to Group-Level Emotion Recognition: EmotiW 5.0
Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon (IIT Ropar, India; University of Canberra, Australia; University of Waterloo, Canada; Australian National University, Australia) Research in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects of affective computing. This year there are two sub-challenges: a) Audio-video emotion recognition and b) group-level emotion recognition. These challenges are based on the acted facial expressions in the wild and group affect databases, respectively. The particular focus of the challenge is to evaluate method in `in the wild' settings. `In the wild' here is used to describe the various environments represented in the images and videos, which represent real-world (not lab like) scenarios. The baseline, data, protocol of the two challenges and the challenge participation are discussed in detail in this paper. @InProceedings{ICMI17p524, author = {Abhinav Dhall and Roland Goecke and Shreya Ghosh and Jyoti Joshi and Jesse Hoey and Tom Gedeon}, title = {From Individual to Group-Level Emotion Recognition: EmotiW 5.0}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {524--528}, doi = {}, year = {2017}, } Info |
|
Goecke, Roland |
ICMI '17-EMOTIW: "From Individual to Group-Level ..."
From Individual to Group-Level Emotion Recognition: EmotiW 5.0
Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon (IIT Ropar, India; University of Canberra, Australia; University of Waterloo, Canada; Australian National University, Australia) Research in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects of affective computing. This year there are two sub-challenges: a) Audio-video emotion recognition and b) group-level emotion recognition. These challenges are based on the acted facial expressions in the wild and group affect databases, respectively. The particular focus of the challenge is to evaluate method in `in the wild' settings. `In the wild' here is used to describe the various environments represented in the images and videos, which represent real-world (not lab like) scenarios. The baseline, data, protocol of the two challenges and the challenge participation are discussed in detail in this paper. @InProceedings{ICMI17p524, author = {Abhinav Dhall and Roland Goecke and Shreya Ghosh and Jyoti Joshi and Jesse Hoey and Tom Gedeon}, title = {From Individual to Group-Level Emotion Recognition: EmotiW 5.0}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {524--528}, doi = {}, year = {2017}, } Info |
|
Goh, Ester Gue Hua |
ICMI '17-EMOTIW: "Audio-Visual Emotion Recognition ..."
Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models
Xi Ouyang, Shigenori Kawaai, Ester Gue Hua Goh, Shengmei Shen, Wan Ding, Huaiping Ming, and Dong-Yan Huang (Panasonic R&D Center, Singapore; Central China Normal University, China; Institute for Infocomm Research at A*STAR, Singapore) This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenges for the wild emotion recognition. Deep network transfer learning is used for feature extraction. Spatial-temporal model fusion is to make full use of the complementary of different networks. Semi-auto reinforcement learning is for the optimization of fusion strategy based on dynamic outside feedbacks given by challenge organizers. The overall accuracy of the proposed approach on the challenge test dataset is 57.2%, which is better than the challenge baseline of 40.47% . @InProceedings{ICMI17p577, author = {Xi Ouyang and Shigenori Kawaai and Ester Gue Hua Goh and Shengmei Shen and Wan Ding and Huaiping Ming and Dong-Yan Huang}, title = {Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {577--582}, doi = {}, year = {2017}, } |
|
Gruzdev, Alexey |
ICMI '17-EMOTIW: "Group-Level Emotion Recognition ..."
Group-Level Emotion Recognition using Transfer Learning from Face Identification
Alexandr Rassadin, Alexey Gruzdev, and Andrey Savchenko (National Research University Higher School of Economics, Russia) In this paper, we describe our algorithmic approach, which was used for submissions in the fifth Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. We extracted feature vectors of detected faces using the Convolutional Neural Network trained for face identification task, rather than traditional pre-training on emotion recognition problems. In the final pipeline an ensemble of Random Forest classifiers was learned to predict emotion score using available training set. In case when the faces have not been detected, one member of our ensemble extracts features from the whole image. During our experimental study, the proposed approach showed the lowest error rate when compared to other explored techniques. In particular, we achieved 75.4% accuracy on the validation data, which is 20% higher than the handcrafted feature-based baseline. The source code using Keras framework is publicly available. @InProceedings{ICMI17p544, author = {Alexandr Rassadin and Alexey Gruzdev and Andrey Savchenko}, title = {Group-Level Emotion Recognition using Transfer Learning from Face Identification}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {544--548}, doi = {}, year = {2017}, } |
|
Guo, Xin |
ICMI '17-EMOTIW: "Group-Level Emotion Recognition ..."
Group-Level Emotion Recognition using Deep Models on Image Scene, Faces, and Skeletons
Xin Guo, Luisa F. Polanía, and Kenneth E. Barner (University of Delaware, USA; American Family Mutual Insurance Company, USA) This paper presents the work submitted to the Group-level Emotion Recognition sub-challenge, which is part of the 5th Emotion Recognition in the Wild (EmotiW 2017) Challenge. The task of this sub-challenge is to classify the emotion of a group of people in each image as positive, neutral or negative. To address this task, a hybrid network that incorporates global scene features, skeleton features of the group, and local facial features is developed. Specifically, deep convolutional neural networks (CNNs) are first trained on the faces of the group, the whole images and the skeletons of the group, and then fused to perform the group-level emotion prediction. Experimental results show that the proposed network achieves 80.05% and 80.61% on the validation and testing sets, respectively, outperforming the baseline of 52.97% and 53.62%. @InProceedings{ICMI17p603, author = {Xin Guo and Luisa F. Polanía and Kenneth E. Barner}, title = {Group-Level Emotion Recognition using Deep Models on Image Scene, Faces, and Skeletons}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {603--608}, doi = {}, year = {2017}, } |
|
He, Jun |
ICMI '17-EMOTIW: "A New Deep-Learning Framework ..."
A New Deep-Learning Framework for Group Emotion Recognition
Qinglan Wei, Yijia Zhao, Qihua Xu, Liandong Li, Jun He, Lejun Yu, and Bo Sun (Beijing Normal University, China) In this paper, we target the Group-level emotion recognition sub-challenge of the fifth Emotion Recognition in the Wild (EmotiW 2017) Challenge, which is based on the Group Affect Database 2.0 containing images of groups of people in a wide variety of social events. We use Seetaface to detect and align the faces in the group images and extract two kinds of face-image visual features: VGGFace-lstm, DCNN-lstm. As group image features, we propose using Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN features, VGG features. To the testing group images on which the faces have been detected, the final emotion is estimated using group image features and face-level visual features. While to the testing group images on which the faces cannot be detected, the face-level visual features are fused for final recognition. The final achievements we have gained are 79.78% accuracy on the Group Affect Database 2.0 testing set, which is much higher than the corresponding baseline results 53.62%. @InProceedings{ICMI17p587, author = {Qinglan Wei and Yijia Zhao and Qihua Xu and Liandong Li and Jun He and Lejun Yu and Bo Sun}, title = {A New Deep-Learning Framework for Group Emotion Recognition}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {587--592}, doi = {}, year = {2017}, } |
|
Hoey, Jesse |
ICMI '17-EMOTIW: "From Individual to Group-Level ..."
From Individual to Group-Level Emotion Recognition: EmotiW 5.0
Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon (IIT Ropar, India; University of Canberra, Australia; University of Waterloo, Canada; Australian National University, Australia) Research in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects of affective computing. This year there are two sub-challenges: a) Audio-video emotion recognition and b) group-level emotion recognition. These challenges are based on the acted facial expressions in the wild and group affect databases, respectively. The particular focus of the challenge is to evaluate method in `in the wild' settings. `In the wild' here is used to describe the various environments represented in the images and videos, which represent real-world (not lab like) scenarios. The baseline, data, protocol of the two challenges and the challenge participation are discussed in detail in this paper. @InProceedings{ICMI17p524, author = {Abhinav Dhall and Roland Goecke and Shreya Ghosh and Jyoti Joshi and Jesse Hoey and Tom Gedeon}, title = {From Individual to Group-Level Emotion Recognition: EmotiW 5.0}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {524--528}, doi = {}, year = {2017}, } Info |
|
Hu, Ping |
ICMI '17-EMOTIW: "Learning Supervised Scoring ..."
Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen (Intel Labs, China) State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records. @InProceedings{ICMI17p553, author = {Ping Hu and Dongqi Cai and Shandong Wang and Anbang Yao and Yurong Chen}, title = {Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {553--560}, doi = {}, year = {2017}, } |
|
Huang, Dong-Yan |
ICMI '17-EMOTIW: "Audio-Visual Emotion Recognition ..."
Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models
Xi Ouyang, Shigenori Kawaai, Ester Gue Hua Goh, Shengmei Shen, Wan Ding, Huaiping Ming, and Dong-Yan Huang (Panasonic R&D Center, Singapore; Central China Normal University, China; Institute for Infocomm Research at A*STAR, Singapore) This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenges for the wild emotion recognition. Deep network transfer learning is used for feature extraction. Spatial-temporal model fusion is to make full use of the complementary of different networks. Semi-auto reinforcement learning is for the optimization of fusion strategy based on dynamic outside feedbacks given by challenge organizers. The overall accuracy of the proposed approach on the challenge test dataset is 57.2%, which is better than the challenge baseline of 40.47% . @InProceedings{ICMI17p577, author = {Xi Ouyang and Shigenori Kawaai and Ester Gue Hua Goh and Shengmei Shen and Wan Ding and Huaiping Ming and Dong-Yan Huang}, title = {Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {577--582}, doi = {}, year = {2017}, } |
|
Huet, Benoit |
ICMI '17-EMOTIW: "Modeling Multimodal Cues in ..."
Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild
Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, and Benoit Huet (University of Modena and Reggio Emilia, Italy; EURECOM, France) In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81 % and 40.47 %): we achieve an accuracy of 50.39 % and 49.92 % respectively on the validation and the testing data. @InProceedings{ICMI17p536, author = {Stefano Pini and Olfa Ben Ahmed and Marcella Cornia and Lorenzo Baraldi and Rita Cucchiara and Benoit Huet}, title = {Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {536--543}, doi = {}, year = {2017}, } |
|
Jin, Qin |
ICMI '17-EMOTIW: "Emotion Recognition with Multimodal ..."
Emotion Recognition with Multimodal Features and Temporal Models
Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, and Yong Qin (Renmin University of China, China; IBM Research Lab, China) This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods. @InProceedings{ICMI17p598, author = {Shuai Wang and Wenxuan Wang and Jinming Zhao and Shizhe Chen and Qin Jin and Shilei Zhang and Yong Qin}, title = {Emotion Recognition with Multimodal Features and Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {598--602}, doi = {}, year = {2017}, } |
|
Joshi, Jyoti |
ICMI '17-EMOTIW: "From Individual to Group-Level ..."
From Individual to Group-Level Emotion Recognition: EmotiW 5.0
Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon (IIT Ropar, India; University of Canberra, Australia; University of Waterloo, Canada; Australian National University, Australia) Research in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects of affective computing. This year there are two sub-challenges: a) Audio-video emotion recognition and b) group-level emotion recognition. These challenges are based on the acted facial expressions in the wild and group affect databases, respectively. The particular focus of the challenge is to evaluate method in `in the wild' settings. `In the wild' here is used to describe the various environments represented in the images and videos, which represent real-world (not lab like) scenarios. The baseline, data, protocol of the two challenges and the challenge participation are discussed in detail in this paper. @InProceedings{ICMI17p524, author = {Abhinav Dhall and Roland Goecke and Shreya Ghosh and Jyoti Joshi and Jesse Hoey and Tom Gedeon}, title = {From Individual to Group-Level Emotion Recognition: EmotiW 5.0}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {524--528}, doi = {}, year = {2017}, } Info |
|
Jurie, Frédéric |
ICMI '17-EMOTIW: "Temporal Multimodal Fusion ..."
Temporal Multimodal Fusion for Video Emotion Classification in the Wild
Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie (Orange Labs, France; Normandy University, France; CNRS, France) This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework – lying in describing videos by audio and visual features used by a supervised classifier to infer the labels – this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convolutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making generalization difficult. The so-obtained model ranked 4th at the 2017 Emotion in the Wild challenge with the accuracy of 58.8 %. @InProceedings{ICMI17p569, author = {Valentin Vielzeuf and Stéphane Pateux and Frédéric Jurie}, title = {Temporal Multimodal Fusion for Video Emotion Classification in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {569--576}, doi = {}, year = {2017}, } |
|
Kawaai, Shigenori |
ICMI '17-EMOTIW: "Audio-Visual Emotion Recognition ..."
Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models
Xi Ouyang, Shigenori Kawaai, Ester Gue Hua Goh, Shengmei Shen, Wan Ding, Huaiping Ming, and Dong-Yan Huang (Panasonic R&D Center, Singapore; Central China Normal University, China; Institute for Infocomm Research at A*STAR, Singapore) This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenges for the wild emotion recognition. Deep network transfer learning is used for feature extraction. Spatial-temporal model fusion is to make full use of the complementary of different networks. Semi-auto reinforcement learning is for the optimization of fusion strategy based on dynamic outside feedbacks given by challenge organizers. The overall accuracy of the proposed approach on the challenge test dataset is 57.2%, which is better than the challenge baseline of 40.47% . @InProceedings{ICMI17p577, author = {Xi Ouyang and Shigenori Kawaai and Ester Gue Hua Goh and Shengmei Shen and Wan Ding and Huaiping Ming and Dong-Yan Huang}, title = {Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {577--582}, doi = {}, year = {2017}, } |
|
Kim, Dae Ha |
ICMI '17-EMOTIW: "Multi-modal Emotion Recognition ..."
Multi-modal Emotion Recognition using Semi-supervised Learning and Multiple Neural Networks in the Wild
Dae Ha Kim, Min Kyu Lee, Dong Yoon Choi, and Byung Cheol Song (Inha University, South Korea) Human emotion recognition is a research topic that is receiving continuous attention in computer vision and artificial intelligence domains. This paper proposes a method for classifying human emotions through multiple neural networks based on multi-modal signals which consist of image, landmark, and audio in a wild environment. The proposed method has the following features. First, the learning performance of the image-based network is greatly improved by employing both multi-task learning and semi-supervised learning using the spatio-temporal characteristic of videos. Second, a model for converting 1-dimensional (1D) landmark information of face into two-dimensional (2D) images, is newly proposed, and a CNN-LSTM network based on the model is proposed for better emotion recognition. Third, based on an observation that audio signals are often very effective for specific emotions, we propose an audio deep learning mechanism robust to the specific emotions. Finally, so-called emotion adaptive fusion is applied to enable synergy of multiple networks. In the fifth attempt on the given test set in the EmotiW2017 challenge, the proposed method achieved a classification accuracy of 57.12%. @InProceedings{ICMI17p529, author = {Dae Ha Kim and Min Kyu Lee and Dong Yoon Choi and Byung Cheol Song}, title = {Multi-modal Emotion Recognition using Semi-supervised Learning and Multiple Neural Networks in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {529--535}, doi = {}, year = {2017}, } |
|
Lee, Min Kyu |
ICMI '17-EMOTIW: "Multi-modal Emotion Recognition ..."
Multi-modal Emotion Recognition using Semi-supervised Learning and Multiple Neural Networks in the Wild
Dae Ha Kim, Min Kyu Lee, Dong Yoon Choi, and Byung Cheol Song (Inha University, South Korea) Human emotion recognition is a research topic that is receiving continuous attention in computer vision and artificial intelligence domains. This paper proposes a method for classifying human emotions through multiple neural networks based on multi-modal signals which consist of image, landmark, and audio in a wild environment. The proposed method has the following features. First, the learning performance of the image-based network is greatly improved by employing both multi-task learning and semi-supervised learning using the spatio-temporal characteristic of videos. Second, a model for converting 1-dimensional (1D) landmark information of face into two-dimensional (2D) images, is newly proposed, and a CNN-LSTM network based on the model is proposed for better emotion recognition. Third, based on an observation that audio signals are often very effective for specific emotions, we propose an audio deep learning mechanism robust to the specific emotions. Finally, so-called emotion adaptive fusion is applied to enable synergy of multiple networks. In the fifth attempt on the given test set in the EmotiW2017 challenge, the proposed method achieved a classification accuracy of 57.12%. @InProceedings{ICMI17p529, author = {Dae Ha Kim and Min Kyu Lee and Dong Yoon Choi and Byung Cheol Song}, title = {Multi-modal Emotion Recognition using Semi-supervised Learning and Multiple Neural Networks in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {529--535}, doi = {}, year = {2017}, } |
|
Li, Liandong |
ICMI '17-EMOTIW: "A New Deep-Learning Framework ..."
A New Deep-Learning Framework for Group Emotion Recognition
Qinglan Wei, Yijia Zhao, Qihua Xu, Liandong Li, Jun He, Lejun Yu, and Bo Sun (Beijing Normal University, China) In this paper, we target the Group-level emotion recognition sub-challenge of the fifth Emotion Recognition in the Wild (EmotiW 2017) Challenge, which is based on the Group Affect Database 2.0 containing images of groups of people in a wide variety of social events. We use Seetaface to detect and align the faces in the group images and extract two kinds of face-image visual features: VGGFace-lstm, DCNN-lstm. As group image features, we propose using Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN features, VGG features. To the testing group images on which the faces have been detected, the final emotion is estimated using group image features and face-level visual features. While to the testing group images on which the faces cannot be detected, the face-level visual features are fused for final recognition. The final achievements we have gained are 79.78% accuracy on the Group Affect Database 2.0 testing set, which is much higher than the corresponding baseline results 53.62%. @InProceedings{ICMI17p587, author = {Qinglan Wei and Yijia Zhao and Qihua Xu and Liandong Li and Jun He and Lejun Yu and Bo Sun}, title = {A New Deep-Learning Framework for Group Emotion Recognition}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {587--592}, doi = {}, year = {2017}, } |
|
Ming, Huaiping |
ICMI '17-EMOTIW: "Audio-Visual Emotion Recognition ..."
Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models
Xi Ouyang, Shigenori Kawaai, Ester Gue Hua Goh, Shengmei Shen, Wan Ding, Huaiping Ming, and Dong-Yan Huang (Panasonic R&D Center, Singapore; Central China Normal University, China; Institute for Infocomm Research at A*STAR, Singapore) This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenges for the wild emotion recognition. Deep network transfer learning is used for feature extraction. Spatial-temporal model fusion is to make full use of the complementary of different networks. Semi-auto reinforcement learning is for the optimization of fusion strategy based on dynamic outside feedbacks given by challenge organizers. The overall accuracy of the proposed approach on the challenge test dataset is 57.2%, which is better than the challenge baseline of 40.47% . @InProceedings{ICMI17p577, author = {Xi Ouyang and Shigenori Kawaai and Ester Gue Hua Goh and Shengmei Shen and Wan Ding and Huaiping Ming and Dong-Yan Huang}, title = {Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {577--582}, doi = {}, year = {2017}, } |
|
Oruganti, V. Ramana Murthy |
ICMI '17-EMOTIW: "Multi-Level Feature Fusion ..."
Multi-Level Feature Fusion for Group-Level Emotion Recognition
B. Balaji and V. Ramana Murthy Oruganti (Amrita University at Coimbatore, India) In this paper, the influence of low-level and mid-level features is investigated for image-based group emotion recognition. We hypothesize that the human faces, and the objects surrounding them are major sources of information and thus can serve as mid-level features. Hence, we detect faces and objects using pre-trained Deep Net models. Information from different layers in conjunction with different encoding techniques is extensively investigated to obtain the richest feature vectors. The best result obtained classification accuracy of 65.0% on the validation set, is submitted to the Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. The best feature vector yielded 75.1% on the testing set. Post competition, few more experiments were performed and included the same @InProceedings{ICMI17p583, author = {B. Balaji and V. Ramana Murthy Oruganti}, title = {Multi-Level Feature Fusion for Group-Level Emotion Recognition}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {583--586}, doi = {}, year = {2017}, } |
|
Ouyang, Xi |
ICMI '17-EMOTIW: "Audio-Visual Emotion Recognition ..."
Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models
Xi Ouyang, Shigenori Kawaai, Ester Gue Hua Goh, Shengmei Shen, Wan Ding, Huaiping Ming, and Dong-Yan Huang (Panasonic R&D Center, Singapore; Central China Normal University, China; Institute for Infocomm Research at A*STAR, Singapore) This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenges for the wild emotion recognition. Deep network transfer learning is used for feature extraction. Spatial-temporal model fusion is to make full use of the complementary of different networks. Semi-auto reinforcement learning is for the optimization of fusion strategy based on dynamic outside feedbacks given by challenge organizers. The overall accuracy of the proposed approach on the challenge test dataset is 57.2%, which is better than the challenge baseline of 40.47% . @InProceedings{ICMI17p577, author = {Xi Ouyang and Shigenori Kawaai and Ester Gue Hua Goh and Shengmei Shen and Wan Ding and Huaiping Ming and Dong-Yan Huang}, title = {Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {577--582}, doi = {}, year = {2017}, } |
|
Patacchiola, Massimiliano |
ICMI '17-EMOTIW: "Emotion Recognition in the ..."
Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers
Luca Surace, Massimiliano Patacchiola, Elena Battini Sönmez, William Spataro, and Angelo Cangelosi (University of Calabria, Italy; Plymouth University, UK; Instanbul Bilgi University, Turkey) Group emotion recognition in the wild is a challenging problem, due to the unstructured environments in which everyday life pictures are taken. Some of the obstacles for an effective classification are occlusions, variable lighting conditions, and image quality. In this work we present a solution based on a novel combination of deep neural networks and Bayesian classifiers. The neural network works on a bottom-up approach, analyzing emotions expressed by isolated faces. The Bayesian classifier estimates a global emotion integrating top-down features obtained through a scene descriptor. In order to validate the system we tested the framework on the dataset released for the Emotion Recognition in the Wild Challenge 2017. Our method achieved an accuracy of 64.68% on the test set, significantly outperforming the 53.62% competition baseline. @InProceedings{ICMI17p593, author = {Luca Surace and Massimiliano Patacchiola and Elena Battini Sönmez and William Spataro and Angelo Cangelosi}, title = {Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {593--597}, doi = {}, year = {2017}, } |
|
Pateux, Stéphane |
ICMI '17-EMOTIW: "Temporal Multimodal Fusion ..."
Temporal Multimodal Fusion for Video Emotion Classification in the Wild
Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie (Orange Labs, France; Normandy University, France; CNRS, France) This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework – lying in describing videos by audio and visual features used by a supervised classifier to infer the labels – this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convolutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making generalization difficult. The so-obtained model ranked 4th at the 2017 Emotion in the Wild challenge with the accuracy of 58.8 %. @InProceedings{ICMI17p569, author = {Valentin Vielzeuf and Stéphane Pateux and Frédéric Jurie}, title = {Temporal Multimodal Fusion for Video Emotion Classification in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {569--576}, doi = {}, year = {2017}, } |
|
Peng, Xiaojiang |
ICMI '17-EMOTIW: "Group Emotion Recognition ..."
Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs
Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao (SIAT at Chinese Academy of Sciences, China; National Taiwan University, Taiwan) This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on two types of Convolutional Neural Networks (CNNs), namely individual facial emotion CNNs and global image based CNNs. For the individual facial emotion CNNs, we first extract all the faces in an image, and assign the image label to all faces for training. In particular, we utilize a large-margin softmax loss for discriminative learning and we train two CNNs on both aligned and non-aligned faces. For the global image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boost performance. For a test image, we average the scores from all faces and the image to predict the final group emotion category. We win the challenge with accuracies 83.9% and 80.9% on the validation set and testing set respectively, which improve the baseline results by about 30%. @InProceedings{ICMI17p549, author = {Lianzhi Tan and Kaipeng Zhang and Kai Wang and Xiaoxing Zeng and Xiaojiang Peng and Yu Qiao}, title = {Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {549--552}, doi = {}, year = {2017}, } |
|
Pini, Stefano |
ICMI '17-EMOTIW: "Modeling Multimodal Cues in ..."
Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild
Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, and Benoit Huet (University of Modena and Reggio Emilia, Italy; EURECOM, France) In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81 % and 40.47 %): we achieve an accuracy of 50.39 % and 49.92 % respectively on the validation and the testing data. @InProceedings{ICMI17p536, author = {Stefano Pini and Olfa Ben Ahmed and Marcella Cornia and Lorenzo Baraldi and Rita Cucchiara and Benoit Huet}, title = {Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {536--543}, doi = {}, year = {2017}, } |
|
Polanía, Luisa F. |
ICMI '17-EMOTIW: "Group-Level Emotion Recognition ..."
Group-Level Emotion Recognition using Deep Models on Image Scene, Faces, and Skeletons
Xin Guo, Luisa F. Polanía, and Kenneth E. Barner (University of Delaware, USA; American Family Mutual Insurance Company, USA) This paper presents the work submitted to the Group-level Emotion Recognition sub-challenge, which is part of the 5th Emotion Recognition in the Wild (EmotiW 2017) Challenge. The task of this sub-challenge is to classify the emotion of a group of people in each image as positive, neutral or negative. To address this task, a hybrid network that incorporates global scene features, skeleton features of the group, and local facial features is developed. Specifically, deep convolutional neural networks (CNNs) are first trained on the faces of the group, the whole images and the skeletons of the group, and then fused to perform the group-level emotion prediction. Experimental results show that the proposed network achieves 80.05% and 80.61% on the validation and testing sets, respectively, outperforming the baseline of 52.97% and 53.62%. @InProceedings{ICMI17p603, author = {Xin Guo and Luisa F. Polanía and Kenneth E. Barner}, title = {Group-Level Emotion Recognition using Deep Models on Image Scene, Faces, and Skeletons}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {603--608}, doi = {}, year = {2017}, } |
|
Qiao, Yu |
ICMI '17-EMOTIW: "Group Emotion Recognition ..."
Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs
Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao (SIAT at Chinese Academy of Sciences, China; National Taiwan University, Taiwan) This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on two types of Convolutional Neural Networks (CNNs), namely individual facial emotion CNNs and global image based CNNs. For the individual facial emotion CNNs, we first extract all the faces in an image, and assign the image label to all faces for training. In particular, we utilize a large-margin softmax loss for discriminative learning and we train two CNNs on both aligned and non-aligned faces. For the global image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boost performance. For a test image, we average the scores from all faces and the image to predict the final group emotion category. We win the challenge with accuracies 83.9% and 80.9% on the validation set and testing set respectively, which improve the baseline results by about 30%. @InProceedings{ICMI17p549, author = {Lianzhi Tan and Kaipeng Zhang and Kai Wang and Xiaoxing Zeng and Xiaojiang Peng and Yu Qiao}, title = {Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {549--552}, doi = {}, year = {2017}, } |
|
Qin, Yong |
ICMI '17-EMOTIW: "Emotion Recognition with Multimodal ..."
Emotion Recognition with Multimodal Features and Temporal Models
Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, and Yong Qin (Renmin University of China, China; IBM Research Lab, China) This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods. @InProceedings{ICMI17p598, author = {Shuai Wang and Wenxuan Wang and Jinming Zhao and Shizhe Chen and Qin Jin and Shilei Zhang and Yong Qin}, title = {Emotion Recognition with Multimodal Features and Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {598--602}, doi = {}, year = {2017}, } |
|
Rassadin, Alexandr |
ICMI '17-EMOTIW: "Group-Level Emotion Recognition ..."
Group-Level Emotion Recognition using Transfer Learning from Face Identification
Alexandr Rassadin, Alexey Gruzdev, and Andrey Savchenko (National Research University Higher School of Economics, Russia) In this paper, we describe our algorithmic approach, which was used for submissions in the fifth Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. We extracted feature vectors of detected faces using the Convolutional Neural Network trained for face identification task, rather than traditional pre-training on emotion recognition problems. In the final pipeline an ensemble of Random Forest classifiers was learned to predict emotion score using available training set. In case when the faces have not been detected, one member of our ensemble extracts features from the whole image. During our experimental study, the proposed approach showed the lowest error rate when compared to other explored techniques. In particular, we achieved 75.4% accuracy on the validation data, which is 20% higher than the handcrafted feature-based baseline. The source code using Keras framework is publicly available. @InProceedings{ICMI17p544, author = {Alexandr Rassadin and Alexey Gruzdev and Andrey Savchenko}, title = {Group-Level Emotion Recognition using Transfer Learning from Face Identification}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {544--548}, doi = {}, year = {2017}, } |
|
Savchenko, Andrey |
ICMI '17-EMOTIW: "Group-Level Emotion Recognition ..."
Group-Level Emotion Recognition using Transfer Learning from Face Identification
Alexandr Rassadin, Alexey Gruzdev, and Andrey Savchenko (National Research University Higher School of Economics, Russia) In this paper, we describe our algorithmic approach, which was used for submissions in the fifth Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. We extracted feature vectors of detected faces using the Convolutional Neural Network trained for face identification task, rather than traditional pre-training on emotion recognition problems. In the final pipeline an ensemble of Random Forest classifiers was learned to predict emotion score using available training set. In case when the faces have not been detected, one member of our ensemble extracts features from the whole image. During our experimental study, the proposed approach showed the lowest error rate when compared to other explored techniques. In particular, we achieved 75.4% accuracy on the validation data, which is 20% higher than the handcrafted feature-based baseline. The source code using Keras framework is publicly available. @InProceedings{ICMI17p544, author = {Alexandr Rassadin and Alexey Gruzdev and Andrey Savchenko}, title = {Group-Level Emotion Recognition using Transfer Learning from Face Identification}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {544--548}, doi = {}, year = {2017}, } |
|
Shen, Shengmei |
ICMI '17-EMOTIW: "Audio-Visual Emotion Recognition ..."
Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models
Xi Ouyang, Shigenori Kawaai, Ester Gue Hua Goh, Shengmei Shen, Wan Ding, Huaiping Ming, and Dong-Yan Huang (Panasonic R&D Center, Singapore; Central China Normal University, China; Institute for Infocomm Research at A*STAR, Singapore) This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenges for the wild emotion recognition. Deep network transfer learning is used for feature extraction. Spatial-temporal model fusion is to make full use of the complementary of different networks. Semi-auto reinforcement learning is for the optimization of fusion strategy based on dynamic outside feedbacks given by challenge organizers. The overall accuracy of the proposed approach on the challenge test dataset is 57.2%, which is better than the challenge baseline of 40.47% . @InProceedings{ICMI17p577, author = {Xi Ouyang and Shigenori Kawaai and Ester Gue Hua Goh and Shengmei Shen and Wan Ding and Huaiping Ming and Dong-Yan Huang}, title = {Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {577--582}, doi = {}, year = {2017}, } |
|
Song, Byung Cheol |
ICMI '17-EMOTIW: "Multi-modal Emotion Recognition ..."
Multi-modal Emotion Recognition using Semi-supervised Learning and Multiple Neural Networks in the Wild
Dae Ha Kim, Min Kyu Lee, Dong Yoon Choi, and Byung Cheol Song (Inha University, South Korea) Human emotion recognition is a research topic that is receiving continuous attention in computer vision and artificial intelligence domains. This paper proposes a method for classifying human emotions through multiple neural networks based on multi-modal signals which consist of image, landmark, and audio in a wild environment. The proposed method has the following features. First, the learning performance of the image-based network is greatly improved by employing both multi-task learning and semi-supervised learning using the spatio-temporal characteristic of videos. Second, a model for converting 1-dimensional (1D) landmark information of face into two-dimensional (2D) images, is newly proposed, and a CNN-LSTM network based on the model is proposed for better emotion recognition. Third, based on an observation that audio signals are often very effective for specific emotions, we propose an audio deep learning mechanism robust to the specific emotions. Finally, so-called emotion adaptive fusion is applied to enable synergy of multiple networks. In the fifth attempt on the given test set in the EmotiW2017 challenge, the proposed method achieved a classification accuracy of 57.12%. @InProceedings{ICMI17p529, author = {Dae Ha Kim and Min Kyu Lee and Dong Yoon Choi and Byung Cheol Song}, title = {Multi-modal Emotion Recognition using Semi-supervised Learning and Multiple Neural Networks in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {529--535}, doi = {}, year = {2017}, } |
|
Spataro, William |
ICMI '17-EMOTIW: "Emotion Recognition in the ..."
Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers
Luca Surace, Massimiliano Patacchiola, Elena Battini Sönmez, William Spataro, and Angelo Cangelosi (University of Calabria, Italy; Plymouth University, UK; Instanbul Bilgi University, Turkey) Group emotion recognition in the wild is a challenging problem, due to the unstructured environments in which everyday life pictures are taken. Some of the obstacles for an effective classification are occlusions, variable lighting conditions, and image quality. In this work we present a solution based on a novel combination of deep neural networks and Bayesian classifiers. The neural network works on a bottom-up approach, analyzing emotions expressed by isolated faces. The Bayesian classifier estimates a global emotion integrating top-down features obtained through a scene descriptor. In order to validate the system we tested the framework on the dataset released for the Emotion Recognition in the Wild Challenge 2017. Our method achieved an accuracy of 64.68% on the test set, significantly outperforming the 53.62% competition baseline. @InProceedings{ICMI17p593, author = {Luca Surace and Massimiliano Patacchiola and Elena Battini Sönmez and William Spataro and Angelo Cangelosi}, title = {Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {593--597}, doi = {}, year = {2017}, } |
|
Sun, Bo |
ICMI '17-EMOTIW: "A New Deep-Learning Framework ..."
A New Deep-Learning Framework for Group Emotion Recognition
Qinglan Wei, Yijia Zhao, Qihua Xu, Liandong Li, Jun He, Lejun Yu, and Bo Sun (Beijing Normal University, China) In this paper, we target the Group-level emotion recognition sub-challenge of the fifth Emotion Recognition in the Wild (EmotiW 2017) Challenge, which is based on the Group Affect Database 2.0 containing images of groups of people in a wide variety of social events. We use Seetaface to detect and align the faces in the group images and extract two kinds of face-image visual features: VGGFace-lstm, DCNN-lstm. As group image features, we propose using Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN features, VGG features. To the testing group images on which the faces have been detected, the final emotion is estimated using group image features and face-level visual features. While to the testing group images on which the faces cannot be detected, the face-level visual features are fused for final recognition. The final achievements we have gained are 79.78% accuracy on the Group Affect Database 2.0 testing set, which is much higher than the corresponding baseline results 53.62%. @InProceedings{ICMI17p587, author = {Qinglan Wei and Yijia Zhao and Qihua Xu and Liandong Li and Jun He and Lejun Yu and Bo Sun}, title = {A New Deep-Learning Framework for Group Emotion Recognition}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {587--592}, doi = {}, year = {2017}, } |
|
Surace, Luca |
ICMI '17-EMOTIW: "Emotion Recognition in the ..."
Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers
Luca Surace, Massimiliano Patacchiola, Elena Battini Sönmez, William Spataro, and Angelo Cangelosi (University of Calabria, Italy; Plymouth University, UK; Instanbul Bilgi University, Turkey) Group emotion recognition in the wild is a challenging problem, due to the unstructured environments in which everyday life pictures are taken. Some of the obstacles for an effective classification are occlusions, variable lighting conditions, and image quality. In this work we present a solution based on a novel combination of deep neural networks and Bayesian classifiers. The neural network works on a bottom-up approach, analyzing emotions expressed by isolated faces. The Bayesian classifier estimates a global emotion integrating top-down features obtained through a scene descriptor. In order to validate the system we tested the framework on the dataset released for the Emotion Recognition in the Wild Challenge 2017. Our method achieved an accuracy of 64.68% on the test set, significantly outperforming the 53.62% competition baseline. @InProceedings{ICMI17p593, author = {Luca Surace and Massimiliano Patacchiola and Elena Battini Sönmez and William Spataro and Angelo Cangelosi}, title = {Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {593--597}, doi = {}, year = {2017}, } |
|
Tan, Lianzhi |
ICMI '17-EMOTIW: "Group Emotion Recognition ..."
Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs
Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao (SIAT at Chinese Academy of Sciences, China; National Taiwan University, Taiwan) This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on two types of Convolutional Neural Networks (CNNs), namely individual facial emotion CNNs and global image based CNNs. For the individual facial emotion CNNs, we first extract all the faces in an image, and assign the image label to all faces for training. In particular, we utilize a large-margin softmax loss for discriminative learning and we train two CNNs on both aligned and non-aligned faces. For the global image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boost performance. For a test image, we average the scores from all faces and the image to predict the final group emotion category. We win the challenge with accuracies 83.9% and 80.9% on the validation set and testing set respectively, which improve the baseline results by about 30%. @InProceedings{ICMI17p549, author = {Lianzhi Tan and Kaipeng Zhang and Kai Wang and Xiaoxing Zeng and Xiaojiang Peng and Yu Qiao}, title = {Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {549--552}, doi = {}, year = {2017}, } |
|
Vielzeuf, Valentin |
ICMI '17-EMOTIW: "Temporal Multimodal Fusion ..."
Temporal Multimodal Fusion for Video Emotion Classification in the Wild
Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie (Orange Labs, France; Normandy University, France; CNRS, France) This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework – lying in describing videos by audio and visual features used by a supervised classifier to infer the labels – this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convolutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making generalization difficult. The so-obtained model ranked 4th at the 2017 Emotion in the Wild challenge with the accuracy of 58.8 %. @InProceedings{ICMI17p569, author = {Valentin Vielzeuf and Stéphane Pateux and Frédéric Jurie}, title = {Temporal Multimodal Fusion for Video Emotion Classification in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {569--576}, doi = {}, year = {2017}, } |
|
Wang, Kai |
ICMI '17-EMOTIW: "Group Emotion Recognition ..."
Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs
Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao (SIAT at Chinese Academy of Sciences, China; National Taiwan University, Taiwan) This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on two types of Convolutional Neural Networks (CNNs), namely individual facial emotion CNNs and global image based CNNs. For the individual facial emotion CNNs, we first extract all the faces in an image, and assign the image label to all faces for training. In particular, we utilize a large-margin softmax loss for discriminative learning and we train two CNNs on both aligned and non-aligned faces. For the global image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boost performance. For a test image, we average the scores from all faces and the image to predict the final group emotion category. We win the challenge with accuracies 83.9% and 80.9% on the validation set and testing set respectively, which improve the baseline results by about 30%. @InProceedings{ICMI17p549, author = {Lianzhi Tan and Kaipeng Zhang and Kai Wang and Xiaoxing Zeng and Xiaojiang Peng and Yu Qiao}, title = {Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {549--552}, doi = {}, year = {2017}, } |
|
Wang, Shandong |
ICMI '17-EMOTIW: "Learning Supervised Scoring ..."
Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen (Intel Labs, China) State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records. @InProceedings{ICMI17p553, author = {Ping Hu and Dongqi Cai and Shandong Wang and Anbang Yao and Yurong Chen}, title = {Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {553--560}, doi = {}, year = {2017}, } |
|
Wang, Shuai |
ICMI '17-EMOTIW: "Emotion Recognition with Multimodal ..."
Emotion Recognition with Multimodal Features and Temporal Models
Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, and Yong Qin (Renmin University of China, China; IBM Research Lab, China) This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods. @InProceedings{ICMI17p598, author = {Shuai Wang and Wenxuan Wang and Jinming Zhao and Shizhe Chen and Qin Jin and Shilei Zhang and Yong Qin}, title = {Emotion Recognition with Multimodal Features and Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {598--602}, doi = {}, year = {2017}, } |
|
Wang, Wenxuan |
ICMI '17-EMOTIW: "Emotion Recognition with Multimodal ..."
Emotion Recognition with Multimodal Features and Temporal Models
Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, and Yong Qin (Renmin University of China, China; IBM Research Lab, China) This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods. @InProceedings{ICMI17p598, author = {Shuai Wang and Wenxuan Wang and Jinming Zhao and Shizhe Chen and Qin Jin and Shilei Zhang and Yong Qin}, title = {Emotion Recognition with Multimodal Features and Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {598--602}, doi = {}, year = {2017}, } |
|
Wei, Qinglan |
ICMI '17-EMOTIW: "A New Deep-Learning Framework ..."
A New Deep-Learning Framework for Group Emotion Recognition
Qinglan Wei, Yijia Zhao, Qihua Xu, Liandong Li, Jun He, Lejun Yu, and Bo Sun (Beijing Normal University, China) In this paper, we target the Group-level emotion recognition sub-challenge of the fifth Emotion Recognition in the Wild (EmotiW 2017) Challenge, which is based on the Group Affect Database 2.0 containing images of groups of people in a wide variety of social events. We use Seetaface to detect and align the faces in the group images and extract two kinds of face-image visual features: VGGFace-lstm, DCNN-lstm. As group image features, we propose using Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN features, VGG features. To the testing group images on which the faces have been detected, the final emotion is estimated using group image features and face-level visual features. While to the testing group images on which the faces cannot be detected, the face-level visual features are fused for final recognition. The final achievements we have gained are 79.78% accuracy on the Group Affect Database 2.0 testing set, which is much higher than the corresponding baseline results 53.62%. @InProceedings{ICMI17p587, author = {Qinglan Wei and Yijia Zhao and Qihua Xu and Liandong Li and Jun He and Lejun Yu and Bo Sun}, title = {A New Deep-Learning Framework for Group Emotion Recognition}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {587--592}, doi = {}, year = {2017}, } |
|
Xu, Qihua |
ICMI '17-EMOTIW: "A New Deep-Learning Framework ..."
A New Deep-Learning Framework for Group Emotion Recognition
Qinglan Wei, Yijia Zhao, Qihua Xu, Liandong Li, Jun He, Lejun Yu, and Bo Sun (Beijing Normal University, China) In this paper, we target the Group-level emotion recognition sub-challenge of the fifth Emotion Recognition in the Wild (EmotiW 2017) Challenge, which is based on the Group Affect Database 2.0 containing images of groups of people in a wide variety of social events. We use Seetaface to detect and align the faces in the group images and extract two kinds of face-image visual features: VGGFace-lstm, DCNN-lstm. As group image features, we propose using Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN features, VGG features. To the testing group images on which the faces have been detected, the final emotion is estimated using group image features and face-level visual features. While to the testing group images on which the faces cannot be detected, the face-level visual features are fused for final recognition. The final achievements we have gained are 79.78% accuracy on the Group Affect Database 2.0 testing set, which is much higher than the corresponding baseline results 53.62%. @InProceedings{ICMI17p587, author = {Qinglan Wei and Yijia Zhao and Qihua Xu and Liandong Li and Jun He and Lejun Yu and Bo Sun}, title = {A New Deep-Learning Framework for Group Emotion Recognition}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {587--592}, doi = {}, year = {2017}, } |
|
Yao, Anbang |
ICMI '17-EMOTIW: "Learning Supervised Scoring ..."
Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen (Intel Labs, China) State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records. @InProceedings{ICMI17p553, author = {Ping Hu and Dongqi Cai and Shandong Wang and Anbang Yao and Yurong Chen}, title = {Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {553--560}, doi = {}, year = {2017}, } |
|
Yu, Lejun |
ICMI '17-EMOTIW: "A New Deep-Learning Framework ..."
A New Deep-Learning Framework for Group Emotion Recognition
Qinglan Wei, Yijia Zhao, Qihua Xu, Liandong Li, Jun He, Lejun Yu, and Bo Sun (Beijing Normal University, China) In this paper, we target the Group-level emotion recognition sub-challenge of the fifth Emotion Recognition in the Wild (EmotiW 2017) Challenge, which is based on the Group Affect Database 2.0 containing images of groups of people in a wide variety of social events. We use Seetaface to detect and align the faces in the group images and extract two kinds of face-image visual features: VGGFace-lstm, DCNN-lstm. As group image features, we propose using Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN features, VGG features. To the testing group images on which the faces have been detected, the final emotion is estimated using group image features and face-level visual features. While to the testing group images on which the faces cannot be detected, the face-level visual features are fused for final recognition. The final achievements we have gained are 79.78% accuracy on the Group Affect Database 2.0 testing set, which is much higher than the corresponding baseline results 53.62%. @InProceedings{ICMI17p587, author = {Qinglan Wei and Yijia Zhao and Qihua Xu and Liandong Li and Jun He and Lejun Yu and Bo Sun}, title = {A New Deep-Learning Framework for Group Emotion Recognition}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {587--592}, doi = {}, year = {2017}, } |
|
Zeng, Xiaoxing |
ICMI '17-EMOTIW: "Group Emotion Recognition ..."
Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs
Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao (SIAT at Chinese Academy of Sciences, China; National Taiwan University, Taiwan) This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on two types of Convolutional Neural Networks (CNNs), namely individual facial emotion CNNs and global image based CNNs. For the individual facial emotion CNNs, we first extract all the faces in an image, and assign the image label to all faces for training. In particular, we utilize a large-margin softmax loss for discriminative learning and we train two CNNs on both aligned and non-aligned faces. For the global image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boost performance. For a test image, we average the scores from all faces and the image to predict the final group emotion category. We win the challenge with accuracies 83.9% and 80.9% on the validation set and testing set respectively, which improve the baseline results by about 30%. @InProceedings{ICMI17p549, author = {Lianzhi Tan and Kaipeng Zhang and Kai Wang and Xiaoxing Zeng and Xiaojiang Peng and Yu Qiao}, title = {Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {549--552}, doi = {}, year = {2017}, } |
|
Zhang, Kaipeng |
ICMI '17-EMOTIW: "Group Emotion Recognition ..."
Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs
Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao (SIAT at Chinese Academy of Sciences, China; National Taiwan University, Taiwan) This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on two types of Convolutional Neural Networks (CNNs), namely individual facial emotion CNNs and global image based CNNs. For the individual facial emotion CNNs, we first extract all the faces in an image, and assign the image label to all faces for training. In particular, we utilize a large-margin softmax loss for discriminative learning and we train two CNNs on both aligned and non-aligned faces. For the global image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boost performance. For a test image, we average the scores from all faces and the image to predict the final group emotion category. We win the challenge with accuracies 83.9% and 80.9% on the validation set and testing set respectively, which improve the baseline results by about 30%. @InProceedings{ICMI17p549, author = {Lianzhi Tan and Kaipeng Zhang and Kai Wang and Xiaoxing Zeng and Xiaojiang Peng and Yu Qiao}, title = {Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {549--552}, doi = {}, year = {2017}, } |
|
Zhang, Shilei |
ICMI '17-EMOTIW: "Emotion Recognition with Multimodal ..."
Emotion Recognition with Multimodal Features and Temporal Models
Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, and Yong Qin (Renmin University of China, China; IBM Research Lab, China) This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods. @InProceedings{ICMI17p598, author = {Shuai Wang and Wenxuan Wang and Jinming Zhao and Shizhe Chen and Qin Jin and Shilei Zhang and Yong Qin}, title = {Emotion Recognition with Multimodal Features and Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {598--602}, doi = {}, year = {2017}, } |
|
Zhao, Jinming |
ICMI '17-EMOTIW: "Emotion Recognition with Multimodal ..."
Emotion Recognition with Multimodal Features and Temporal Models
Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, and Yong Qin (Renmin University of China, China; IBM Research Lab, China) This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods. @InProceedings{ICMI17p598, author = {Shuai Wang and Wenxuan Wang and Jinming Zhao and Shizhe Chen and Qin Jin and Shilei Zhang and Yong Qin}, title = {Emotion Recognition with Multimodal Features and Temporal Models}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {598--602}, doi = {}, year = {2017}, } |
|
Zhao, Yijia |
ICMI '17-EMOTIW: "A New Deep-Learning Framework ..."
A New Deep-Learning Framework for Group Emotion Recognition
Qinglan Wei, Yijia Zhao, Qihua Xu, Liandong Li, Jun He, Lejun Yu, and Bo Sun (Beijing Normal University, China) In this paper, we target the Group-level emotion recognition sub-challenge of the fifth Emotion Recognition in the Wild (EmotiW 2017) Challenge, which is based on the Group Affect Database 2.0 containing images of groups of people in a wide variety of social events. We use Seetaface to detect and align the faces in the group images and extract two kinds of face-image visual features: VGGFace-lstm, DCNN-lstm. As group image features, we propose using Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN features, VGG features. To the testing group images on which the faces have been detected, the final emotion is estimated using group image features and face-level visual features. While to the testing group images on which the faces cannot be detected, the face-level visual features are fused for final recognition. The final achievements we have gained are 79.78% accuracy on the Group Affect Database 2.0 testing set, which is much higher than the corresponding baseline results 53.62%. @InProceedings{ICMI17p587, author = {Qinglan Wei and Yijia Zhao and Qihua Xu and Liandong Li and Jun He and Lejun Yu and Bo Sun}, title = {A New Deep-Learning Framework for Group Emotion Recognition}, booktitle = {Proc.\ ICMI}, publisher = {ACM}, pages = {587--592}, doi = {}, year = {2017}, } |
66 authors
proc time: 2.17