ICMI 2016
18th ACM International Conference on Multimodal Interaction (ICMI 2016)
Powered by
Conference Publishing Consulting

18th ACM International Conference on Multimodal Interaction (ICMI 2016), November 12–16, 2016, Tokyo, Japan

ICMI 2016 – Proceedings

Contents - Abstracts - Authors


Title Page

Message from the Chairs

Welcome to Tokyo and to the 18th ACM International Conference on Multimodal Interaction, ICMI 2016. ICMI is the premier international forum for multidisciplinary research on multimodal human-human and human-computer interaction, interfaces, and system development. The conference focuses on theoretical and empirical foundations, component technologies, and combined multimodal processing techniques that define the field of multimodal interaction analysis, interface design, and system development. This year, ICMI focuses on machine learning for multimodal interaction as a special topic of interest. ICMI 2016 features a single-track main conference which includes: keynote speakers, technical full and short papers (including oral and poster presentations), demonstrations, companies exhibits, tutorials, grand challenges, and doctoral consortium. It is followed by a day with workshops.

The ICMI 2016 call for long and short papers attracted a record number of 144 paper submissions (78 in the long category and 66 in the short category). The papers were reviewed by a program committee composed of 25 Senior Program Committee (SPC) members and 197 technical reviewers. A special emphasis was put forward this year to increase the quality of each review. Before the rebuttal process, SPC members took the time to revise each review and send follow-up directions to reviewers if needed. During the rebuttal process, the authors had the opportunity to clarify any misunderstandings and respond to questions raised in the reviews and meta-reviews. After the rebuttal phase, the SPC members discussed actively each paper, including the author rebuttal. The whole review process was extended by two weeks when compared to previous years, and closely supervised by ICMI 2016 Program Chairs. As a result, 24 papers were accepted for oral presentation and 31 papers were accepted for poster presentation. The acceptance rate is 17% for oral presentations and 38% overall, for short and long papers combined.

The Sustained Accomplishment Award in ICMI is presented to a scientist who has made innovative and long lasting contributions to our field. The award acknowledges an individual who has demonstrated vision in shaping the field, with a sustained record of research that has influenced the work of others. This year’s award is presented to Prof. Wolfgang Wahlster (DFKI, Germany). He will give a plenary talk titled “Help me if you can: Towards Multiadaptive Interaction Platforms”.

This year, the conference will host three invited keynote speakers. They are:

The main ICMI conference program includes an exciting Demonstration session co-chaired by Ronald Poppe (Utrecht University, The Netherlands) and Ryo Ishii (NTT, Japan) that will showcase innovative implementations, systems, and technologies that incorporate multimodal interaction. In addition, the Demonstration session this year will feature a Multimodal Resources track to showcase novel corpora, annotation tools and schemes. The number of submission for the Demonstration session was 25, and 17 papers were accepted. The Demonstration session will also include demonstrations that accompany accepted main track papers.

A number of satellite events will be held along with the main conference. A new satellite event introduced at ICMI this year is a tutorial. The lecturer is Dr. Louis-Philippe Morency. He will teach fundamental concepts related to multimodal machine learning on November 12th.

The Doctoral Consortium is by now a traditional ICMI satellite event which takes place on the first day of the conference and extends our commitment to the next generation of researchers. This year, the event is co-chaired by Dirk Heylen (University of Twente, The Netherlands) and Samer Al Moubayed (KTH, Sweden). In this special session, a highly-accomplished mentor team and selected senior PhD students gather around a round table to discuss research plans and progress of each student. From among 16 applications, 14 students were accepted. The accepted students receive a travel grant and registration waiver to attend both the Doctoral Consortium event, and the main conference. The organizers thank the U.S. National Science Foundation, the SIGCHI Student Travel Grant (SSTG), and conference sponsors for the financial support that makes this possible.

The Multimodal Grand Challenges were introduced to ICMI in 2012. This year's challenges are co-chaired by Hatice Gunes (University of Cambridge, UK) and Mohammad Soleymani (University of Geneva, Switzerland), and include the Fourth Emotion Recognition in the Wild (EmotiW). The Grand Challenge event will be presented on November 12th, and an overview and poster session will take place during the main conference.

The ICMI workshop program was co-chaired this year by Julien Epps (The University of New South Wales, Australia) and Gabriel Skantze (KTH, Sweden). The workshops this year cover a variety of topics around multimodal interaction, such as embodied agents, VR/AR, social signals, and multi-sensory systems. Seven workshops will be held after the main conference on November 16th. They are: the 1st Workshop on Embodied Interaction with Smart Environments, the 2nd International Workshop on Advancements in Social Signal Processing for Multimodal Interaction (ASSP4MI@ICMI2016), the 2nd Workshop on Emotion Representations and Modelling for Companion Systems (ERM4CT 2016), the Workshop on Multimodal Virtual and Augmented Reality - MVAR 2016, the Workshop on Social learning and multimodal interaction for designing artificial agents, the 1st Workshop on Multi-Sensorial Approaches to Human-Food Interaction, and the Workshop on Multimodal Analyses enabling Artificial Agents in Human-­Machine Interaction (MA3HMI).

Outstanding paper awards have also been a tradition at ICMI, including both Outstanding Paper award and Outstanding Student Paper award. The Program Chairs considered the top-ranked paper submissions based on the reviews and meta-reviews and identified a set of nominations for the awards. A paper award committee was created with internationally renowned researchers in multimodal interaction. The committee reviewed the nominated papers carefully and selected the recipients of these awards, which will be announced at the banquet. ICMI is also continuing the tradition of promoting strong and in-depth reviews by giving Recognition for Excellence in Reviewing awards.

A separate committee was formed to evaluate the nominees for the Sustained Accomplishment Award (previously mentioned in this message). We would like to thank the members of this committee: Julien Epps, Matthew Turk and Jie Yang.

ICMI 2016 Organization
Committee Listings
ICMI 2016 Sponsors and Supporters
Sponsors and Supporters

Invited Talks

Understanding People by Tracking Their Word Use (Keynote)
James W. Pennebaker
(University of Texas at Austin, USA)
The words people use in their conversations, emails, and diaries can tell us how they think, approach problems, connect with others, and their behaviors. Of particular interest are people's use of function words -- pronouns, articles, and other small and forgettable words. Processed in the brain differently from content words, function words reveal where people are paying attention and how they think about themselves and others. After summarizing dozens of studies on language and psychological state, the talk will explore how text analysis can help us get inside the heads of the people we study.
Publisher's Version Article Search
Learning to Generate Images and Their Descriptions (Keynote)
Richard Zemel
(University of Toronto, Canada)
Recent advances in computer vision, natural language processing and related areas has led to a renewed interest in artificial intelligence applications spanning multiple domains. Specifically, the generation of natural human-like captions for images has seen an extraordinary increase in interest. I will describe approaches that combine state-of-the-art computer vision techniques and language models to produce descriptions of visual content with surprisingly high quality. Related methods have also led to significant progress in generating images. The limitations of current approaches and the challenges that lie ahead will both be emphasized.
Publisher's Version Article Search
Embodied Media: Expanding Human Capacity via Virtual Reality and Telexistence (Keynote)
Susumu Tachi
(University of Tokyo, Japan)
The information we acquire in real life gives us a holistic experience that fully incorporates a variety of sensations and bodily motions such as seeing, hearing, speaking, touching, smelling, tasting, and moving. However, the sensory modalities that can be transmitted in our information space are usually limited to visual and auditory ones. Haptic information is rarely used in the information space in our daily lives except in the case of warnings or alerts such as cellphone vibrations. Embodied media such as virtual reality and telexistence should provide holistic sensations, i.e., integrating visual, auditory, haptic, palatal, olfactory, and kinesthetic sensations, such that human users feel that they are present in a computer-generated virtual information space or a remote space having an alternate presence in the environment. Haptics plays an important role in embodied media because it provides both proprioception and cutaneous sensations; it lets users feel like they are touching distant people and objects and also lets them “touch” artificial objects as they see them. In this keynote, an embodied media, which extends human experiences, is overviewed and our research on an embodied media that is both visible and tangible based on our proposed theory of haptic primary colors is introduced. The embodied media would enable telecommunication, tele-experience, and pseudo-experience providing sensations such that the user would feel like working in a natural environment. It would also enable humans to engage in creative activities such as design and creation as though they were in the real environment. We have succeeded in transmitting fine haptic sensations, such as material texture and temperature, from an avatar robot’s fingers to a human user’s fingers. The avatar robot is a telexistence anthropomorphic robot, called TELESAR V, with a body and limbs with 53 degrees of freedom. This robot can transmit not only visual and auditory sensations of presence to human users but also realistic haptic sensations. Our other inventions include RePro3D, which is a full-parallax autostereoscopic 3D (three-dimensional) display with haptic feedback using RPT (retroreflective projection technology); TECHTILE Toolkit, which is a prototyping tool for the design and improvement of haptic media; and HaptoMIRAGE, which is an 180°-field-of-view autostereoscopic 3D display using ARIA (active-shuttered real image autostereoscopy) that can be used by three users simultaneously.
Publisher's Version Article Search
Help Me If You Can: Towards Multiadaptive Interaction Platforms (ICMI Awardee Talk)
Wolfgang Wahlster
(DFKI, Germany)
Autonomous Systems like self-driving cars and collaborative robots must occasionally ask people around them for help in anomalous situations. A new generation of multiadaptive interaction platforms provides a comprehensive multimodal presentation of the current situation in real-time, so that a smooth transfer of control back and forth between human agents and AI systems is guaranteed. We present the anatomy of our multiadaptive human-environment interaction platform which includes explicit models of the attentional and cognitive state of the human agents as well as a dynamic model of the cyber-physical environment, and supports massive multimodality, multiscale and multiparty interaction. It is based on the principles of symmetric multimodality and bidirectional representations: all input modes are also available as output modes and vice versa, so that the system not only understands and represents the user’s multimodal input, but also its own multimodal output. We illustrate our approach with examples from advanced automotive and manufacturing applications.
Publisher's Version Article Search

Main Track

Oral Session 1: Multimodal Social Agents

Trust Me: Multimodal Signals of Trustworthiness
Gale Lucas, Giota Stratou, Shari Lieblich, and Jonathan Gratch
(University of Southern California, USA; Temple College, USA)
This paper builds on prior psychological studies that identify signals of trustworthiness between two human negotiators. Unlike prior work, the current work tracks such signals automatically and fuses them into computational models that predict trustworthiness. To achieve this goal, we apply automatic trackers to recordings of human dyads negotiating in a multi-issue bargaining task. We identify behavioral indicators in different modalities (facial expressions, gestures, gaze, and conversational features) that are predictive of trustworthiness. We predict both objective trustworthiness (i.e., are they honest) and perceived trustworthiness (i.e., do they seem honest to their interaction partner). Our experiments show that people are poor judges of objective trustworthiness (i.e., objective and perceived trustworthiness are predicted by different indicators), and that multimodal approaches better predict objective trustworthiness, whereas people overly rely on facial expressions when judging the honesty of their partner. Moreover, domain knowledge (from the literature and prior analysis of behaviors) facilitates the model development process.
Publisher's Version Article Search
Semi-situated Learning of Verbal and Nonverbal Content for Repeated Human-Robot Interaction
Iolanda Leite, André Pereira, Allison Funkhouser, Boyang Li, and Jill Fain Lehman
(Disney Research, USA)
Content authoring of verbal and nonverbal behavior is a limiting factor when developing agents for repeated social interactions with the same user. We present PIP, an agent that crowdsources its own multimodal language behavior using a method we call semi-situated learning. PIP renders segments of its goal graph into brief stories that describe future situations, sends the stories to crowd workers who author and edit a single line of character dialog and its manner of expression, integrates the results into its goal state representation, and then uses the authored lines at similar moments in conversation. We present an initial case study in which the language needed to host a trivia game interaction is learned pre-deployment and tested in an autonomous system with 200 users "in the wild." The interaction data suggests that the method generates both meaningful content and variety of expression.
Publisher's Version Article Search
Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Audio-Visual Feedback Tokens
Catharine Oertel, José Lopes, Yu Yu, Kenneth A. Funes Mora, Joakim Gustafson, Alan W. Black, and Jean-Marc Odobez
(KTH, Sweden; EPFL, Switzerland; Idiap, Switzerland; Carnegie Mellon University, USA)
Current dialogue systems typically lack a variation of audio-visual feedback tokens. Either they do not encompass feedback tokens at all, or only support a limited set of stereotypical functions. However, this does not mirror the subtleties of spontaneous conversations. If we want to be able to build an artificial listener, as a first step towards building an empathetic artificial agent, we also need to be able to synthesize more subtle audio-visual feedback tokens. In this study, we devised an array of monomodal and multimodal binary comparison perception tests and experiments to understand how different realisations of verbal and visual feedback tokens influence third-party perception of the degree of attentiveness. This allowed us to investigate i) which features (amplitude, frequency, duration...) of the visual feedback influences attentiveness perception; ii) whether visual or verbal backchannels are perceived to be more attentive iii) whether the fusion of unimodal tokens with low perceived attentiveness increases the degree of perceived attentiveness compared to unimodal tokens with high perceived attentiveness taken alone; iv) the automatic ranking of audio-visual feedback token in terms of conveyed degree of attentiveness.
Publisher's Version Article Search
Sequence-Based Multimodal Behavior Modeling for Social Agents
Soumia Dermouche and Catherine Pelachaud
(CNRS, France; Telecom ParisTech, France)
The goal of this work is to model a virtual character able to converse with different interpersonal attitudes. To build our model, we rely on the analysis of multimodal corpora of non-verbal behaviors. The interpretation of these behaviors depends on how they are sequenced (order) and distributed over time. To encompass the dynamics of non-verbal signals across both modalities and time, we make use of temporal sequence mining. Specifically, we propose a new algorithm for temporal sequence extraction. We apply our algorithm to extract temporal patterns of non-verbal behaviors expressing interpersonal attitudes from a corpus of job interviews. We demonstrate the efficiency of our algorithm in terms of significant accuracy improvement over the state-of-the-art algorithms.
Publisher's Version Article Search

Oral Session 2: Physiological and Tactile Modalities

Adaptive Review for Mobile MOOC Learning via Implicit Physiological Signal Sensing
Phuong Pham and Jingtao Wang
(University of Pittsburgh, USA)
Massive Open Online Courses (MOOCs) have the potential to enable high quality knowledge dissemination in large scale at low cost. However, today's MOOCs also suffer from low engagement, uni-directional information flow, and lack of personalization. In this paper, we propose AttentiveReview, an effective intervention technology for mobile MOOC learning. AttentiveReview infers a learner's perceived difficulty levels of the corresponding learning materials via implicit photoplethysmography (PPG) sensing on unmodified smartphones. AttentiveReview also recommends personalized review sessions through a user-independent model. In a 32-participant user study, we found that: 1) AttentiveReview significantly improved information recall (+14.6%) and learning gain (+17.4%) when compared with the no review condition; 2) AttentiveReview also achieved comparable performances at significantly less time when compared with the full review condition; 3) As an end-to-end mobile tutoring system, the benefits of AttentiveReview outweigh side-effects from false positives and false negatives. Overall, we show that it is feasible to improve mobile MOOC learning by recommending review materials adaptively from rich but noisy physiological signals.
Publisher's Version Article Search
Visuotactile Integration for Depth Perception in Augmented Reality
Nina Rosa, Wolfgang Hürst, Peter Werkhoven, and Remco Veltkamp
(Utrecht University, Netherlands)
Augmented reality applications using stereo head-mounted displays are not capable of perfectly blending real and virtual objects. For example, depth in the real world is perceived through cues such as accommodation and vergence. However, in stereo head-mounted displays these cues are disconnected since the virtual is generally projected at a static distance, while vergence changes with depth. This conflict can result in biased depth estimation of virtual objects in a real environment. In this research, we examined whether redundant tactile feedback can reduce the bias in perceived depth in a reaching task. In particular, our experiments proved that a tactile mapping of distance to vibration intensity or vibration position on the skin can be used to determine a virtual object's depth. Depth estimation when using only tactile feedback was more accurate than when using only visual feedback, and when using visuotactile feedback it was more precise and occurred faster than when using unimodal feedback. Our work demonstrates the value of multimodal feedback in augmented reality applications that require correct depth perception, and provides insights on various possible visuotactile implementations.
Publisher's Version Article Search
Exploring Multimodal Biosignal Features for Stress Detection during Indoor Mobility
Kyriaki Kalimeri and Charalampos Saitis
(ISI Foundation, Italy; TU Berlin, Germany)

This paper presents a multimodal framework for assessing the emotional and cognitive experience of blind and visually impaired people when navigating in unfamiliar indoor environments based on mobile monitoring and fusion of electroencephalography (EEG) and electrodermal activity (EDA) signals. The overall goal is to understand which environmental factors increase stress and cognitive load in order to help design emotionally intelligent mobility technologies that are able to adapt to stressful environments from real-time biosensor data. We propose a model based on a random forest classifier which successfully infers in an automatic way (weighted AUROC 79.3%) the correct environment among five predefined categories expressing generic everyday situations of varying complexity and difficulty, where different levels of stress are likely to occur. Time-locating the most predictive multimodal features that relate to cognitive load and stress, we provide further insights into the relationship of specific biomarkers with the environmental/situational factors that evoked them.

Publisher's Version Article Search
An IDE for Multimodal Controls in Smart Buildings
Sebastian Peters, Jan Ole Johanssen, and Bernd Bruegge
(TU Munich, Germany)
Smart buildings have become an essential part of our daily life. Due to the increased availability of addressable fixtures and controls, the operation of building components has become complex for occupants, especially when multimodal controls are involved. Thus, occupants are confronted with the need to manage these controls in a user-friendly way. We propose the MIBO IDE, an integrated development environment for defining, connecting, and managing multimodal controls designed with end-users in mind. The MIBO IDE provides a convenient way for occupants to create and customize multimodal interaction models while taking the occupants' preferences, culture, and potential physical limitations into account. The MIBO IDE allows occupants to apply and utilize various components such as the MIBO Editor. This core component can be used to define interaction models visually using the drag-and-drop metaphor and therefore without requiring programming skills. In addition, the MIBO IDE provides components for debugging and compiling interaction models as well as detecting conflicts among them.
Publisher's Version Article Search Info

Poster Session 1

Personalized Unknown Word Detection in Non-native Language Reading using Eye Gaze
Rui Hiraoka, Hiroki Tanaka, Sakriani Sakti, Graham Neubig, and Satoshi Nakamura
(Nara Institute of Science and Technology, Japan)
This paper proposes a method to detect unknown words during natural reading of non-native language text by using eye-tracking features. A previous approach utilizes gaze duration and word rarity features to perform this detection. However, while this system can be used by trained users, its performance is not sufficient during natural reading by untrained users. In this paper, we 1) apply support vector machines (SVM) with novel eye movement features that were not considered in the previous work and 2) examine the effect of personalization. The experimental results demonstrate that learning using SVMs and proposed eye movement features improves detection performance as measured by F-measure and that personalization further improves results.
Publisher's Version Article Search
Discovering Facial Expressions for States of Amused, Persuaded, Informed, Sentimental and Inspired
Daniel McDuff
(Affectiva, USA; Microsoft Research, USA)
Facial expressions play a significant role in everyday interactions. A majority of the research on facial expressions of emotion has focused on a small set of "basic" states. However, in real-life the expression of emotions is highly context dependent and prototypic expressions of "basic" emotions may not always be present. In this paper we attempt to discover expressions associated with alternate states of informed, inspired, persuaded, sentimental and amused based on a very large dataset of observed facial responses. We used a curated set of 395 everyday videos that were found to reliably elicit the states and recorded 49,869 facial responses as viewers watched the videos in their homes. Using automated facial coding we quantified the presence of 18 facial actions in each of the 23.4 million frames. Lip corner pulls, lip sucks and inner brow raises were prominent in sentimental responses. Outer brow raises and eye widening were prominent in persuaded and informed responses. More brow furrowing distinguished informed from persuaded responses potentially indicating higher cognition.
Publisher's Version Article Search
Do Speech Features for Detecting Cognitive Load Depend on Specific Languages?
Rui Chen, Tiantian Xie, Yingtao Xie, Tao Lin, and Ningjiu Tang
(Sichuan University, China; University of Maryland in Baltimore County, USA)
Speech-based cognitive load modeling recently proposed in English have enabled objective, quantitative and unobtrusive evaluation of cognitive load without extra equipment. However, no evidence indicates that these techniques could be applied to speech data in other languages without modification. In this study, a modified Stroop Test and a Reading Span Task were conducted to collect speech data in English and Chinese respectively, from which twenty non-linguistic features were extracted to investigate whether they were language dependent. Some discriminating speech features were observed language dependent, which serves as an evidence that there is a necessity to adapt speech-based cognitive load detection techniques to diverse language contexts for a higher performance.
Publisher's Version Article Search
Training on the Job: Behavioral Analysis of Job Interviews in Hospitality
Skanda Muralidhar, Laurent Son Nguyen, Denise Frauendorfer, Jean-Marc Odobez, Marianne Schmid Mast, and Daniel Gatica-Perez
(Idiap, Switzerland; EPFL, Switzerland; University of Lausanne, Switzerland)

First impressions play a critical role in the hospitality industry and have been shown to be closely linked to the behavior of the person being judged.In this work, we implemented a behavioral training framework for hospitality students with the goal of improving the impressions that other people make about them. We outline the challenges associated with designing such a framework and embedding it in the everyday practice of a real hospitality school. We collected a dataset of 169 laboratory sessions where two role-plays were conducted, job interviews and reception desk scenarios, for a total of 338 interactions. For job interviews, we evaluated the relationship between automatically extracted nonverbal cues and various perceived social variables in a correlation analysis. Furthermore, our system automatically predicted first impressions from job interviews in a regression task, and was able to explain up to 32% of the variance, thus extending the results in existing literature, and showing gender differences, corroborating previous findings in psychology. This work constitutes a step towards applying social sensing technologies to the real world by designing and implementing a living lab for students of an international hospitality management school.

Publisher's Version Article Search
Emotion Spotting: Discovering Regions of Evidence in Audio-Visual Emotion Expressions
Yelin Kim and Emily Mower Provost
(SUNY Albany, USA; University of Michigan, USA)
Research has demonstrated that humans require different amounts of information, over time, to accurately perceive emotion expressions. This varies as a function of emotion classes. For example, recognition of happiness requires a longer stimulus than recognition of anger. However, previous automatic emotion recognition systems have often overlooked these differences. In this work, we propose a data-driven framework to explore patterns (timings and durations) of emotion evidence, specific to individual emotion classes. Further, we demonstrate that these patterns vary as a function of which modality (lower face, upper face, or speech) is examined, and consistent patterns emerge across different folds of experiments. We also show similar patterns across emotional corpora (IEMOCAP and MSP-IMPROV). In addition, we show that our proposed method, which uses only a portion of the data (59% for the IEMOCAP), achieves comparable accuracy to a system that uses all of the data within each utterance. Our method has a higher accuracy when compared to a baseline method that randomly chooses a portion of the data. We show that the performance gain of the method is mostly from prototypical emotion expressions (defined as expressions with rater consensus). The innovation in this study comes from its understanding of how multimodal cues reveal emotion over time.
Publisher's Version Article Search
Semi-supervised Model Personalization for Improved Detection of Learner's Emotional Engagement
Nese Alyuz, Eda Okur, Ece Oktay, Utku Genc, Sinem Aslan, Sinem Emine Mete, Bert Arnrich, and Asli Arslan Esme
(Intel, Turkey; Bogazici University, Turkey)
Affective states play a crucial role in learning. Existing Intelligent Tutoring Systems (ITSs) fail to track affective states of learners accurately. Without an accurate detection of such states, ITSs are limited in providing truly personalized learning experience. In our longitudinal research, we have been working towards developing an empathic autonomous 'tutor' closely monitoring students in real-time using multiple sources of data to understand their affective states corresponding to emotional engagement. We focus on detecting learning related states (i.e., 'Satisfied', 'Bored', and 'Confused'). We have collected 210 hours of data through authentic classroom pilots of 17 sessions. We collected information from two modalities: (1) appearance, which is collected from the camera, and (2) context-performance, that is derived from the content platform. The learning content of the content platform consists of two section types: (1) instructional where students watch instructional videos and (2) assessment where students solve exercise questions. Since there are individual differences in expressing affective states, the detection of emotional engagement needs to be customized for each individual. In this paper, we propose a hierarchical semi-supervised model adaptation method to achieve highly accurate emotional engagement detectors. In the initial calibration phase, a personalized context-performance classifier is obtained. In the online usage phase, the appearance classifier is automatically personalized using the labels generated by the context-performance model. The experimental results show that personalization enables performance improvement of our generic emotional engagement detectors. The proposed semi-supervised hierarchical personalization method result in 89.23% and 75.20% F1 measures for the instructional and assessment sections respectively.
Publisher's Version Article Search
Driving Maneuver Prediction using Car Sensor and Driver Physiological Signals
Nanxiang Li, Teruhisa Misu, Ashish Tawari, Alexandre Miranda, Chihiro Suga, and Kikuo Fujimura
(Honda Research Institute, USA)
This study presents the preliminary attempt to investigate the usage of driver physiology signals, including electrocardiography (ECG) and respiration wave signals, to predict driving maneuvers. While most studies on driving maneuver prediction uses direct measurements from vehicle or road scene, we believe the mental state changes from the driver when making plans for maneuver can be reflected from the physiological signals. We extract both time and frequency domain features from the physiological signals, and use them as the features to predict the drivers' future maneuver. We formulate the prediction of driver maneuver as a multi-class classification problem by using the features extracted from signal before the driving maneuvers. The multi classes correspond to various types of driving maneuvers including Start, Stop, Lane Switch and Turn. We use the support vector machine (SVM) as the classifier, and compare the performance of using both physiological and car signals (CAN bus) with the baseline classifier that is trained with only car signal. An improved performance is observed when using the physiological features with 0.04 in F-score on average. This improvement is more obvious as the prediction is made earlier.
Publisher's Version Article Search
On Leveraging Crowdsourced Data for Automatic Perceived Stress Detection
Jonathan Aigrain, Arnaud Dapogny, Kévin Bailly, Séverine Dubuisson, Marcin Detyniecki, and Mohamed Chetouani
(UPMC, France; CNRS, France; LIP6, France; Polish Academy of Sciences, Poland)
Resorting to crowdsourcing platforms is a popular way to obtain annotations. Multiple potentially noisy answers can thus be aggregated to retrieve an underlying ground truth. However, it may be irrelevant to look for a unique ground truth when we ask crowd workers for opinions, notably when dealing with subjective phenomena such as stress. In this paper, we discuss how we can better use crowdsourced annotations with an application to automatic detection of perceived stress. Towards this aim, we first acquired video data from 44 subjects in a stressful situation and gathered answers to a binary question using a crowdsourcing platform. Then, we propose to integrate two measures derived from the set of gathered answers into the machine learning framework. First, we highlight that using the consensus level among crowd worker answers substantially increases classification accuracies. Then, we show that it is suitable to directly predict for each video the proportion of positive answers to the question from the different crowd workers. Hence, we propose a thorough study on how crowdsourced annotations can be used to enhance performance of classification and regression methods.
Publisher's Version Article Search
Investigating the Impact of Automated Transcripts on Non-native Speakers' Listening Comprehension
Xun Cao, Naomi Yamashita, and Toru Ishida
(Kyoto University, Japan; NTT, Japan)
Real-time transcripts generated by automatic speech recognition (ASR) technologies hold potential to facilitate non-native speakers’ (NNSs) listening comprehension. While introducing another modality (i.e., ASR transcripts) to NNSs provides supplemental information to understand speech, it also runs the risk of overwhelming them with excessive information. The aim of this paper is to understand the advantages and disadvantages of presenting ASR transcripts to NNSs and to study how such transcripts affect listening experiences. To explore these issues, we conducted a laboratory experiment with 20 NNSs who engaged in two listening tasks in different conditions: audio only and audio+ASR transcripts. In each condition, the participants described the comprehension problems they encountered while listening. From the analysis, we found that ASR transcripts helped NNSs solve certain problems (e.g., “do not recognize words they know”), but imperfect ASR transcripts (e.g., errors and no punctuation) sometimes confused them and even generated new problems. Furthermore, post-task interviews and gaze analysis of the participants revealed that NNSs did not have enough time to fully exploit the transcripts. For example, NNSs had difficulty shifting between multimodal contents. Based on our findings, we discuss the implications for designing better multimodal interfaces for NNSs.
Publisher's Version Article Search
Speaker Impact on Audience Comprehension for Academic Presentations
Keith Curtis, Gareth J. F. Jones, and Nick Campbell
(Dublin City University, Ireland; Trinity College Dublin, Ireland)
Understanding audience comprehension levels for presentations has the potential to enable richer and more focused interaction with audio-visual recordings. We describe an investigation into automated analysis and classification of audience comprehension during academic presentations. We identify audio and visual features considered to most aid audience understanding. To obtain gold standards for comprehension levels, human annotators watched contiguous video segments from a corpus of academic presentations and estimated how much they understood or comprehended the content. We investigate pre-fusion and post-fusion strategies over a number of input streams and demonstrate the most effective modalities for classification of comprehension. We demonstrate that it is possible to build a classifier to predict potential audience comprehension levels, obtaining accuracy over a 7-class range of 52.9%, and over a binary classification problem to 85.4%.
Publisher's Version Article Search
EmoReact: A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children
Behnaz Nojavanasghari, Tadas Baltrušaitis, Charles E. Hughes, and Louis-Philippe Morency
(University of Central Florida, USA; Carnegie Mellon University, USA)
Automatic emotion recognition plays a central role in the technologies underlying social robots, affect-sensitive human computer interaction design and affect-aware tutors. Although there has been a considerable amount of research on automatic emotion recognition in adults, emotion recognition in children has been understudied. This problem is more challenging as children tend to fidget and move around more than adults, leading to more self-occlusions and non-frontal head poses. Also, the lack of publicly available datasets for children with annotated emotion labels leads most researchers to focus on adults. In this paper, we introduce a newly collected multimodal emotion dataset of children between the ages of four and fourteen years old. The dataset contains 1102 audio-visual clips annotated for 17 different emotional states: six basic emotions, neutral, valence and nine complex emotions including curiosity, uncertainty and frustration. Our experiments compare unimodal and multimodal emotion recognition baseline models to enable future research on this topic. Finally, we present a detailed analysis of the most indicative behavioral cues for emotion recognition in children.
Publisher's Version Article Search
Bimanual Input for Multiscale Navigation with Pressure and Touch Gestures
Sebastien Pelurson and Laurence Nigay
(University of Grenoble, France; LIG, France; CNRS, France)
We explore the combination of touch modalities with pressure-based modalities for multiscale navigation in bifocal views. We investigate a two-hand mobile configuration in which: 1) The dominant hand is kept free for precise touch interaction at any scale of a bifocal view, and 2) The non-dominant hand is used for holding the device in landscape mode, keeping the thumb free for pressure input for navigation at the context scale. The pressure sensor is fixed to the front bezel. Our investigation of pressure-based modalities involves two design options: control (continuous or discrete) and inertia (with or without). The pressure-based modalities are compared to touch-only modalities: the well-known drag-flick and drag-drop modalities. The results show that continuous pressure-based modality without inertia is 1) the fastest one along with the drag-drop touch modality 2) is preferred by the users and 3) importantly minimizes screen occlusion during a phase that requires navigating a large part of the information space.
Publisher's Version Article Search
Intervention-Free Selection using EEG and Eye Tracking
Felix Putze, Johannes Popp, Jutta Hild, Jürgen Beyerer, and Tanja Schultz
(University of Bremen, Germany; KIT, Germany; Fraunhofer IOSB, Germany)
In this paper, we show how recordings of gaze movements (via eye tracking) and brain activity (via electroencephalography) can be combined to provide an interface for implicit selection in a graphical user interface. This implicit selection works completely without manual intervention by the user. In our approach, we formulate implicit selection as a classification problem, describe the employed features and classification setup and introduce our experimental setup for collecting evaluation data. With a fully online-capable setup, we can achieve an F_0.2-score of up to 0.74 for temporal localization and a spatial localization accuracy of more than 0.95.
Publisher's Version Article Search
Automated Scoring of Interview Videos using Doc2Vec Multimodal Feature Extraction Paradigm
Lei Chen, Gary Feng, Chee Wee Leong, Blair Lehman, Michelle Martin-Raugh, Harrison Kell, Chong Min Lee, and Su-Youn Yoon
(Educational Testing Service, USA)
As the popularity of video-based job interviews rises, so does the need for automated tools to evaluate interview performance. Real world hiring decisions are based on assessments of knowledge and skills as well as holistic judgments of person-job fit. While previous research on automated scoring of interview videos shows promise, it lacks coverage of monologue-style responses to structured interview (SI) questions and content-focused interview rating. We report the development of a standardized video interview protocol as well as human rating rubrics focusing on verbal content, personality, and holistic judgment. A novel feature extraction method using ``visual words" automatically learned from video analysis outputs and the Doc2Vec paradigm is proposed. Our promising experimental results suggest that this novel method provides effective representations for the automated scoring of interview videos.
Publisher's Version Article Search
Estimating Communication Skills using Dialogue Acts and Nonverbal Features in Multiple Discussion Datasets
Shogo Okada, Yoshihiko Ohtake, Yukiko I. Nakano, Yuki Hayashi, Hung-Hsuan Huang, Yutaka Takase, and Katsumi Nitta
(Tokyo Institute of Technology, Japan; Seikei University, Japan; Osaka Prefecture University, Japan; Ritsumeikan University, Japan)

This paper focuses on the computational analysis of the individual communication skills of participants in a group. The computational analysis was conducted using three novel aspects to tackle the problem. First, we extracted features from dialogue (dialog) act labels capturing how each participant communicates with the others. Second, the communication skills of each participant were assessed by 21 external raters with experience in human resource management to obtain reliable skill scores for each of the participants. Third, we used the MATRICS corpus, which includes three types of group discussion datasets to analyze the influence of situational variability regarding to the discussion types. We developed a regression model to infer the score for communication skill using multimodal features including linguistic and nonverbal features: prosodic, speaking turn, and head activity. The experimental results show that the multimodal fusing model with feature selection achieved the best accuracy, 0.74 in R2 of the communication skill. A feature analysis of the models revealed the task-dependent and task-independent features to contribute to the prediction performance.

Publisher's Version Article Search
Multi-Sensor Modeling of Teacher Instructional Segments in Live Classrooms
Patrick J. Donnelly, Nathaniel Blanchard, Borhan Samei, Andrew M. Olney, Xiaoyi Sun, Brooke Ward, Sean Kelly, Martin Nystrand, and Sidney K. D'Mello
(University of Notre Dame, USA; University of Memphis, USA; University of Wisconsin-Madison, USA; University of Pittsburgh, USA)
We investigate multi-sensor modeling of teachers’ instructional segments (e.g., lecture, group work) from audio recordings collected in 56 classes from eight teachers across five middle schools. Our approach fuses two sensors: a unidirectional microphone for teacher audio and a pressure zone microphone for general classroom audio. We segment and analyze the audio streams with respect to discourse timing, linguistic, and paralinguistic features. We train supervised classifiers to identify the five instructional segments that collectively comprised a majority of the data, achieving teacher-independent F1 scores ranging from 0.49 to 0.60. With respect to individual segments, the individual sensor models and the fused model were on par for Question & Answer and Procedures & Directions segments. For Supervised Seatwork, Small Group Work, and Lecture segments, the classroom model outperformed both the teacher and fusion models. Across all segments, a multi-sensor approach led to an average 8% improvement over the state of the art approach that only analyzed teacher audio. We discuss implications of our findings for the emerging field of multimodal learning analytics.
Publisher's Version Article Search

Oral Session 3: Groups, Teams, and Meetings

Meeting Extracts for Discussion Summarization Based on Multimodal Nonverbal Information
Fumio Nihei, Yukiko I. Nakano, and Yutaka Takase
(Seikei University, Japan)
Group discussions are used for various purposes, such as creating new ideas and making a group decision. It is desirable to archive the results and processes of the discussion as useful resources for the group. Therefore, a key technology would be a way to extract meaningful resources from a group discussion. To accomplish this goal, we propose classification models that select meeting extracts to be included in the discussion summary based on nonverbal behavior such as attention, head motion, prosodic features, and co-occurrence patterns of these behaviors. We create different prediction models depending on the degree of extract-worthiness, which is assessed by the agreement ratio among human judgments. Our best model achieves 0.707 in F-measure and 0.75 in recall rate, and can compress a discussion into 45% of its original duration. The proposed models reveal that nonverbal information is indispensable for selecting meeting extracts of a group discussion. One of the future directions is to implement the models as an automatic meeting summarization system.
Publisher's Version Article Search
Getting to Know You: A Multimodal Investigation of Team Behavior and Resilience to Stress
Catherine Neubauer, Joshua Woolley, Peter Khooshabeh, and Stefan Scherer
(University of Southern California, USA; University of California at San Francisco, USA; Army Research Lab at Los Angeles, USA)
Team cohesion has been suggested to be a critical factor in emotional resilience following periods of stress. Team cohesion may depend on several factors including emotional state, communication among team members and even psychophysiological response. The present study sought to employ several multimodal techniques designed to investigate team behavior as a means of understanding resilience to stress. We recruited 40 subjects to perform a cooperative-task in gender-matched, two-person teams. They were responsible for working together to meet a common goal, which was to successfully disarm a simulated bomb. This high-workload task requires successful cooperation and communication among members. We assessed several behaviors that relate to facial expression, word choice and physiological responses (i.e., heart rate variability) within this scenario. A manipulation of an â€oeice breaker” condition was used to induce a level of comfort or familiarity within the team prior to the task. We found that individuals in the â€oeice breaker” condition exhibited better resilience to subjective stress following the task. These individuals also exhibited more insight and cognitive speech, more positive facial expressions and were also able to better regulate their emotional expression during the task, compared to the control.
Publisher's Version Article Search
Measuring the Impact of Multimodal Behavioural Feedback Loops on Social Interactions
Ionut Damian, Tobias Baur, and Elisabeth André
(University of Augsburg, Germany)
In this paper we explore the concept of automatic behavioural feedback loops during social interactions. Behavioural feedback loops (BFL) are rapid processes which analyse the behaviour of the user in realtime and provide the user with live feedback on how to improve the behaviour quality. In this context, we implemented an open source software framework for designing, creating and executing BFL on Android powered mobile devices. To get a better understanding of the effects of BFL on face-to-face social interactions, we conducted a user study and compared between four different BFL types spanning three modalities: tactile, auditory and visual. For the study, the BFL have been designed to improve the users' perception of their speaking time in an effort to create more balanced group discussions. The study yielded valuable insights into the impact of BFL on conversations and how humans react to such systems.
Publisher's Version Article Search
Analyzing Mouth-Opening Transition Pattern for Predicting Next Speaker in Multi-party Meetings
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka
(NTT, Japan)
Techniques that use nonverbal behaviors to predict turn-changing situations—e.g., predicting who will speak next and when, in multi-party meetings—have been receiving a lot of attention in recent research. In this research, we explored the transition pattern of the degree of mouth opening (MOTP) towards the end of an utterance to predict the next speaker in multiparty meetings. First, we collected data on utterances and on the degree of mouth opening (closed, slightly open, and wide open) from participants in four-person meetings. The representative results of the analysis of the MOTPs showed that the speaker often continues to open the mouth slightly in turn-keeping and starts to close the mouth from opening it slightly or continues to open the mouth largely in turn-changing. The next speaker often starts to open the mouth slightly from closing it in turn-changing. On the basis of these findings, we constructed next-speaker prediction models using the MOTPs. In addition, as a multimodal fusion, we constructed models using the MOTPs and gaze information, which is known to be one of the most useful types of information for the next-speaker prediction. The evaluation of the models suggests that the speaker's and listeners' MOTPs are effective for predicting the next speaker in multi-party meetings. It also suggests the multimodal fusion using the MOTP and gaze information is more useful for the prediction than using one or the other.
Publisher's Version Article Search

Oral Session 4: Personality and Emotion

Automatic Recognition of Self-Reported and Perceived Emotion: Does Joint Modeling Help?
Biqiao Zhang, Georg Essl, and Emily Mower Provost
(University of Michigan, USA; University of Wisconsin-Milwaukee, USA)
Emotion labeling is a central component of automatic emotion recognition. Evaluators are asked to estimate the emotion label given a set of cues, produced either by themselves (self-report label) or others (perceived label). This process is complicated by the mismatch between the intentions of the producer and the interpretation of the perceiver. Traditionally, emotion recognition systems use only one of these types of labels when estimating the emotion content of data. In this paper, we explore the impact of jointly modeling both an individual's self-report and the perceived label of others. We use deep belief networks (DBN) to learn a representative feature space, and model the potentially complementary relationship between intention and perception using multi-task learning. We hypothesize that the use of DBN feature-learning and multi-task learning of self-report and perceived emotion labels will improve the performance of emotion recognition systems. We test this hypothesis on the IEMOCAP dataset, an audio-visual and motion-capture emotion corpus. We show that both DBN feature learning and multi-task learning offer complementary gains. The results demonstrate that the perceived emotion tasks see greatest performance gain for emotionally subtle utterances, while the self-report emotion tasks see greatest performance gain for emotionally clear utterances. Our results suggest that the combination of knowledge from the self-report and perceived emotion labels lead to more effective emotion recognition systems.
Publisher's Version Article Search
Personality Classification and Behaviour Interpretation: An Approach Based on Feature Categories
Sheng Fang, Catherine Achard, and Séverine Dubuisson
(UPMC, France)
This paper focuses on recognizing and understanding social dimensions (the personality traits and social impressions) during small group interactions. We extract a set of audio and visual features, which are divided into three categories: intra-personal features (i.e. related to only one participant), dyadic features (i.e. related to a pair of participants) and one vs all features (i.e. related to one participant versus the other members of the group). First, we predict the personality traits (PT) and social impressions (SI) by using these three feature categories. Then, we analyse the interplay be- tween groups of features and the personality traits/social impressions of the interacting participants. The prediction is done by using Support Vector Machine and Ridge Regression which allows to determine the most dominant features for each social dimension. Our experiments show that the combination of intra-personal and one vs all features can greatly improve the prediction accuracy of personality traits and social impressions. Prediction accuracy reaches 81.37% for the social impression named ’Rank of Dominance’. Finally, we draw some interesting conclusions about the relationship between personality traits/social impressions and social features.
Publisher's Version Article Search
Multiscale Kernel Locally Penalised Discriminant Analysis Exemplified by Emotion Recognition in Speech
Xinzhou Xu, Jun Deng, Maryna Gavryukova, Zixing Zhang, Li Zhao, and Björn Schuller
(Southeast University, China; TU Munich, Germany; University of Passau, Germany; Imperial College London, UK)
We propose a novel method to learn multiscale kernels with locally penalised discriminant analysis, namely Multiscale-Kernel Locally Penalised Discriminant Analysis (MS-KLPDA). As an exemplary use-case, we apply it to recognise emotions in speech. Specifically, we employ the term of locally penalised discriminant analysis by controlling the weights of marginal sample pairs, while the method learns kernels with multiple scales. Evaluated in a series of experiments on emotional speech corpora, our proposed MS-KLPDA is able to outperform the previous research of Multiscale-Kernel Fisher Discriminant Analysis and some conventional methods in solving speech emotion recognition.
Publisher's Version Article Search
Estimating Self-Assessed Personality from Body Movements and Proximity in Crowded Mingling Scenarios
Laura Cabrera-Quiros, Ekin Gedik, and Hayley Hung
(Delft University of Technology, Netherlands; Instituto Tecnológico de Costa Rica, Costa Rica)
This paper focuses on the automatic classification of self-assessed personality traits from the HEXACO inventory during crowded mingle scenarios. We exploit acceleration and proximity data from a wearable device hung around the neck. Unlike most state-of-the-art studies, addressing personality estimation during mingle scenarios provides a challenging social context as people interact dynamically and freely in a face-to-face setting. While many former studies use audio to extract speech-related features, we present a novel method of extracting an individual’s speaking status from a single body worn triaxial accelerometer which scales easily to large populations. Moreover, by fusing both speech and movement energy related cues from just acceleration, our experimental results show improvements on the estimation of Humility over features extracted from a single behavioral modality. We validated our method on 71 participants where we obtained an accuracy of 69% for Honesty, Conscientiousness and Openness to Experience. To our knowledge, this is the largest validation of personality estimation carried out in such a social context with simple wearable sensors.
Publisher's Version Article Search

Poster Session 2

Deep Learning Driven Hypergraph Representation for Image-Based Emotion Recognition
Yuchi Huang and Hanqing Lu
(Chinese Academy of Sciences, China)
In this paper, we proposed a bi-stage framework for image-based emotion recognition by combining the advantages of deep convolutional neural networks (D-CNN) and hypergraphs. To exploit the representational power of D-CNN, we remodeled its last hidden feature layer as the `attribute' layer in which each hidden unit produces probabilities on a specific semantic attribute. To describe the high-order relationship among facial images, each face was assigned to various hyperedges according to the computed probabilities on different D-CNN attributes. In this way, we tackled the emotion prediction problem by a transductive learning approach, which tends to assign the same label to faces that share many incidental hyperedges (attributes), with the constraints that predicted labels of training samples should be similar to their ground truth labels. We compared the proposed approach to state-of-the-art methods and its effectiveness was demonstrated by extensive experimentation.
Publisher's Version Article Search
Towards a Listening Agent: A System Generating Audiovisual Laughs and Smiles to Show Interest
Kevin El Haddad, Hüseyin Çakmak, Emer Gilmartin, Stéphane Dupont, and Thierry Dutoit
(University of Mons, Belgium; Trinity College Dublin, Ireland)
In this work, we experiment with the use of smiling and laughter in order to help create more natural and efficient listening agents. We present preliminary results on a system which predicts smile and laughter sequences in one dialogue participant based on observations of the other participant's behavior. This system also predicts the level of intensity or arousal for these sequences. We also describe an audiovisual concatenative synthesis process used to generate laughter and smiling sequences, producing multilevel amusement expressions from a dataset of audiovisual laughs. We thus present two contributions: one in the generation of smiling and laughter responses, the other in the prediction of what laughter and smiles to use in response to an interlocutor's behaviour. Both the synthesis system and the prediction system have been evaluated via Mean Opinion Score tests and have proved to give satisfying and promising results which open the door to interesting perspectives.
Publisher's Version Article Search
Sound Emblems for Affective Multimodal Output of a Robotic Tutor: A Perception Study
Helen Hastie, Pasquale Dente, Dennis Küster, and Arvid Kappas
(Heriot-Watt University, UK; Jacobs University, Germany)
Human and robot tutors alike have to give careful consideration as to how feedback is delivered to students to provide a motivating yet clear learning context. Here, we performed a perception study to investigate attitudes towards negative and positive robot feedback in terms of perceived emotional valence on the dimensions of 'Pleasantness', 'Politeness' and 'Naturalness'. We find that, indeed, negative feedback is perceived as significantly less polite and pleasant. Unlike humans who have the capacity to leverage various paralinguistic cues to convey subtle variations of meaning and emotional climate, presently robots are much less expressive. However, they have one advantage that they can combine synthetic robotic sound emblems with verbal feedback. We investigate whether these sound emblems, and their position in the utterance, can be used to modify the perceived emotional valence of the robot feedback. We discuss this in the context of an adaptive robotic tutor interacting with students in a multimodal learning environment.
Publisher's Version Article Search
Automatic Detection of Very Early Stage of Dementia through Multimodal Interaction with Computer Avatars
Hiroki Tanaka, Hiroyoshi Adachi, Norimichi Ukita, Takashi Kudo, and Satoshi Nakamura
(Nara Institute of Science and Technology, Japan; Osaka University Health Care Center, Japan)
This paper proposes a new approach to detecting very early stage of dementia automatically. We develop a computer avatar with spoken dialog functionalities that produces natural spoken queries referring to Mini Mental State Examination, Wechsler Memory Scale-Revised and other related questions. Multimodal interactive data of spoken dialogues from 18 participants (9 dementias and 9 healthy controls) are recorded, and audiovisual features are extracted. We confirm that the support vector machines can classify into two groups with 0.94 detection performance as measured by areas under ROC curve. It is found that our system has possibilities to detect very early stage of dementia through spoken dialog with our computer avatars.
Publisher's Version Article Search
MobileSSI: Asynchronous Fusion for Social Signal Interpretation in the Wild
Simon Flutura, Johannes Wagner, Florian Lingenfelser, Andreas Seiderer, and Elisabeth André
(University of Augsburg, Germany)
Over the last years, mobile devices have become an integral part of people's everyday life. At the same time, they provide more and more computational power and memory capacity to perform complex calculations that formerly could only be accomplished with bulky desktop machines. These capabilities combined with the willingness of people to permanently carry them around open up completely new perspectives to the area of Social Signal Processing. To allow for an immediate analysis and interaction, real-time assessment is necessary. To exploit the benefits of multiple sensors, fusion algorithms are required that are able to cope with data loss in asynchronous data streams. In this paper we present MobileSSI, a port of the Social Signal Interpretation (SSI) framework to Android and embedded Linux platforms. We will test to what extent it is possible to run sophisticated synchronization and fusion mechanisms in an everyday mobile setting and compare the results with similar tasks in a laboratory environment.
Publisher's Version Article Search Info
Language Proficiency Assessment of English L2 Speakers Based on Joint Analysis of Prosody and Native Language
Yue Zhang, Felix Weninger, Anton Batliner, Florian Hönig, and Björn Schuller
(Imperial College London, UK; Nuance Communications, Germany; University of Passau, Germany; University of Erlangen-Nuremberg, Germany)
In this work, we present an in-depth analysis of the interdependency between the non-native prosody and the native language (L1) of English L2 speakers, as separately investigated in the Degree of Nativeness Task and the Native Language Task of the INTERSPEECH 2015 and 2016 Computational Paralinguistics ChallengE (ComParE). To this end, we propose a multi-task learning scheme based on auxiliary attributes for jointly learning the tasks of L1 classification and prosody score regression. The effectiveness of this scheme is demonstrated in extensive experimental runs, comparing various standardised feature sets of prosodic, cepstral, spectral, and voice quality descriptors, as well as automatic feature selection. In the result, we show that the prediction of both prosody score and L1 can be improved by considering both tasks in a holistic way. In particular, we achieve an 11% relative gain in regression performance (Spearman's correlation coefficient) on prosody scores, when comparing the best multi- and single-task learning results.
Publisher's Version Article Search
Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution
Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang
(Microsoft Research, USA)
Crowd sourcing has become a widely adopted scheme to collect ground truth labels. However, it is a well-known problem that these labels can be very noisy. In this paper, we demonstrate how to learn a deep convolutional neural network (DCNN) from noisy labels, using facial expression recognition as an example. More specifically, we have 10 taggers to label each input image, and compare four different approaches to utilizing the multiple labels: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. We show that the traditional majority voting scheme does not perform as well as the last two approaches that fully leverage the label distribution. An enhanced FER+ data set with multiple labels for each face image will also be shared with the research community.
Publisher's Version Article Search
Deep Multimodal Fusion for Persuasiveness Prediction
Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, and Louis-Philippe Morency
(University of Central Florida, USA; Carnegie Mellon University, USA)
Persuasiveness is a high-level personality trait that quantifies the influence a speaker has on the beliefs, attitudes, intentions, motivations, and behavior of the audience. With social multimedia becoming an important channel in propagating ideas and opinions, analyzing persuasiveness is very important. In this work, we use the publicly available Persuasive Opinion Multimedia (POM) dataset to study persuasion. One of the challenges associated with this problem is the limited amount of annotated data. To tackle this challenge, we present a deep multimodal fusion architecture which is able to leverage complementary information from individual modalities for predicting persuasiveness. Our methods show significant improvement in performance over previous approaches.
Publisher's Version Article Search
Comparison of Three Implementations of HeadTurn: A Multimodal Interaction Technique with Gaze and Head Turns
Oleg Špakov, Poika Isokoski, Jari Kangas, Jussi Rantala, Deepak Akkil, and Roope Raisamo
(University of Tampere, Finland)
The best way to construct user interfaces for smart glasses is not yet known. We investigated the use of eye tracking in this context in two experiments. The eye and head movements were combined so that one can select the object to interact by looking at it and then change a setting in that object by turning the head horizontally. We compared three different techniques for mapping the head turn to scrolling a list of numbers with and without haptic feedback. We found that the haptic feedback had no noticeable effect in objective metrics, but it sometimes improved user experience. Direct mapping of head orientation to list position is fast and easy to understand, but the signal-to-noise ratio of eye and head position measurement limits the possible range. The technique with constant rate of change after crossing the head angle threshold was simple and functional, but slow when the rate of change is adjusted to suit beginners. Finally the rate of change dependent on the head angle tends to lead to fairly long task completion times, although in theory it offers a good combination of speed and accuracy.
Publisher's Version Article Search
Effects of Multimodal Cues on Children's Perception of Uncanniness in a Social Robot
Maike Paetzel, Christopher Peters, Ingela Nyström, and Ginevra Castellano
(Uppsala University, Sweden; KTH, Sweden)
This paper investigates the influence of multimodal incongruent gender cues on the perception of a robot's uncanniness and gender in children. The back-projected robot head Furhat was equipped with a female and male face texture and voice synthesizer and the voice and facial cues were tested in congruent and incongruent combinations. 106 children between the age of 8 and 13 participated in the study. Results show that multimodal incongruent cues do not trigger the feeling of uncanniness in children. These results are significant as they support other recent research showing that the perception of uncanniness cannot be triggered by a categorical ambiguity in the robot. In addition, we found that children rely on auditory cues much stronger than on the facial cues when assigning a gender to the robot if presented with incongruent cues. These findings have implications for the robot design, as it seems possible to change the gender of a robot by only changing its voice without creating a feeling of uncanniness in a child.
Publisher's Version Article Search
Multimodal Feedback for Finger-Based Interaction in Mobile Augmented Reality
Wolfgang Hürst and Kevin Vriens
(Utrecht University, Netherlands; TWNKLS, Netherlands)
Mobile or handheld augmented reality uses a smartphone's live video stream and enriches it with superimposed graphics. In such scenarios, tracking one's fingers in front of the camera and interpreting these traces as gestures offers interesting perspectives for interaction. Yet, the lack of haptic feedback provides challenges that need to be overcome. We present a pilot study where three types of feedback (audio, visual, haptic) and combinations thereof are used to support basic finger-based gestures (grab, release). A comparative study with 26 subjects shows an advantage in providing combined, multimodal feedback. In addition, it suggests high potential of haptic feedback via phone vibration, which is surprising given the fact that it is held with the other, non-interacting hand.
Publisher's Version Article Search
Smooth Eye Movement Interaction using EOG Glasses
Murtaza Dhuliawala, Juyoung Lee, Junichi Shimizu, Andreas Bulling, Kai Kunze, Thad Starner, and Woontack Woo
(Georgia Institute of Technology, USA; KAIST, South Korea; Keio University, Japan; Max Planck Institute for Informatics, Germany)
Orbits combines a visual display and an eye motion sensor to allow a user to select between options by tracking a cursor with the eyes as the cursor travels in a circular path around each option. Using an off-the-shelf Jins MEME pair of eyeglasses, we present a pilot study that suggests that the eye movement required for Orbits can be sensed using three electrodes: one in the nose bridge and one in each nose pad. For forced choice binary selection, we achieve a 2.6 bits per second (bps) input rate at 250ms per input. We also inntroduce Head Orbits, where the user fixates the eyes on a target and moves the head in synchrony with the orbiting target. Measuring only the relative movement of the eyes in relation to the head, this method achieves a maximum rate of 2.0 bps at 500ms per input. Finally, we combine the two techniques together with a gyro to create an interface with a maximum input rate of 5.0 bps.
Publisher's Version Article Search
Active Speaker Detection with Audio-Visual Co-training
Punarjay Chakravarty, Jeroen Zegers, Tinne Tuytelaars, and Hugo Van hamme
(KU Leuven, Belgium; iMinds, Belgium)
In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.
Publisher's Version Article Search Video Artifacts Available
Detecting Emergent Leader in a Meeting Environment using Nonverbal Visual Features Only
Cigdem Beyan, Nicolò Carissimi, Francesca Capozzi, Sebastiano Vascon, Matteo Bustreo, Antonio Pierro, Cristina Becchio, and Vittorio Murino
(IIT Genoa, Italy; McGill University, Canada; University of Venice, Italy; Sapienza University of Rome, Italy; University of Turin, Italy)
In this paper, we propose an effective method for emergent leader detection in meeting environments which is based on nonverbal visual features. Identifying emergent leader is an important issue for organizations. It is also a well-investigated topic in social psychology while a relatively new problem in social signal processing (SSP). The effectiveness of nonverbal features have been shown by many previous SSP studies. In general, the nonverbal video-based features were not more effective compared to audio-based features although, their fusion generally improved the overall performance. However, in absence of audio sensors, the accurate detection of social interactions is still crucial. Motivating from that, we propose novel, automatically extracted, nonverbal features to identify the emergent leadership. The extracted nonverbal features were based on automatically estimated visual focus of attention which is based on head pose. The evaluation of the proposed method and the defined features were realized using a new dataset which is firstly introduced in this paper including its design, collection and annotation. The effectiveness of the features and the method were also compared with many state of the art features and methods.
Publisher's Version Article Search
Stressful First Impressions in Job Interviews
Ailbhe N. Finnerty, Skanda Muralidhar, Laurent Son Nguyen, Fabio Pianesi, and Daniel Gatica-Perez
(Fondazione Bruno Kessler, Italy; CIMeC, Italy; Idiap, Switzerland; EPFL, Switzerland; EIT Digital, Italy)
Stress can impact many aspects of our lives, such as the way we interact and work with others, or the first impressions that we make. In the past, stress has been most commonly assessed through self-reported questionnaires; however, advancements in wearable technology have enabled the measurement of physiological symptoms of stress in an unobtrusive manner. Using a dataset of job interviews, we investigate whether first impressions of stress (from annotations) are equivalent to physiological measurements of the electrodermal activity (EDA). We examine the use of automatically extracted nonverbal cues stemming from both the visual and audio modalities, as well EDA stress measurements for the inference of stress impressions obtained from manual annotations. Stress impressions were found to be significantly negatively correlated with hireability ratings i.e individuals who were perceived to be more stressed were more likely to obtained lower hireability scores. The analysis revealed a significant relationship between audio and visual features but low predictability and no significant effects were found for the EDA features. While some nonverbal cues were more clearly related to stress, the physiological cues were less reliable and warrant further investigation into the use of wearable sensors for stress detection.
Publisher's Version Article Search

Oral Session 5: Gesture, Touch, and Haptics

Analyzing the Articulation Features of Children's Touchscreen Gestures
Alex Shaw and Lisa Anthony
(University of Florida, USA)
Children’s touchscreen interaction patterns are generally quite different from those of adults. In particular, it has been established that children’s gestures are recognized by existing algorithms with much lower accuracy than are adults’ gestures. Previous work has qualitatively and quantitatively analyzed adults’ gestures to promote improved recognition, but this has not been done for children’s gestures in the same systematic manner. We present an analysis of gestures elicited from 24 children (age 5 to 10 years old) and 27 adults in which we calculate geometric, kinematic, and relative articulation features of the gestures. We examine the effect of user age on 22 different gesture features to better understand how children’s gesturing abilities and behaviors differ between various age groups, and from adults. We discuss the implications of our findings and how they will contribute to creating new gesture recognition algorithms tailored specifically for children.
Publisher's Version Article Search
Reach Out and Touch Me: Effects of Four Distinct Haptic Technologies on Affective Touch in Virtual Reality
Imtiaj Ahmed, Ville Harjunen, Giulio Jacucci, Eve Hoggan, Niklas Ravaja, and Michiel M. Spapé
(Aalto University, Finland; University of Helsinki, Finland; Liverpool Hope University, UK)
Virtual reality presents an extraordinary platform for multimodal communication. Haptic technologies have been shown to provide an important contribution to this by facilitating co-presence and allowing affective communication. However, the findings of the affective influences rely on studies that have used myriad different types of haptic technology, making it likely that some forms of tactile feedback are more efficient in communicating emotions than others. To find out whether this is true and which haptic technologies are most effective, we measured user experience during a communication scenario featuring an affective agent and interpersonal touch in virtual reality. Interpersonal touch was simulated using two types of vibrotactile actuators and two types of force feedback mechanisms. Self-reports of subjective experience of the agent’s touch and emotions were obtained. The results revealed that, regardless of the agent’s expression, force feedback actuators were rated as more natural and resulted in greater emotional interdependence and a stronger sense of co-presence than vibrotactile touch.
Publisher's Version Article Search
Using Touchscreen Interaction Data to Predict Cognitive Workload
Philipp Mock, Peter Gerjets, Maike Tibus, Ulrich Trautwein, Korbinian Möller, and Wolfgang Rosenstiel
(Leibniz-Institut für Wissensmedien, Germany; University of Tübingen, Germany)
Although a great number of today’s learning applications run on devices with an interactive screen, the high-resolution interaction data which these devices provide have not been used for workload-adaptive systems yet. This paper aims at exploring the potential of using touch sensor data to predict user states in learning scenarios. For this purpose, we collected touch interaction patterns of children solving math tasks on a multi-touch device. 30 fourth-grade students from a primary school participated in the study. Based on these data, we investigate how machine learning methods can be applied to predict cognitive workload associated with tasks of varying difficulty. Our results show that interaction patterns from a touchscreen can be used to significantly improve automatic prediction of high levels of cognitive workload (average classification accuracy of 90.67% between the easiest and most difficult tasks). Analyzing an extensive set of features, we discuss which characteristics are most likely to be of high value for future implementations. We furthermore elaborate on design choices made for the used multiple choice interface and discuss critical factors that should be considered for future touch-based learning interfaces.
Publisher's Version Article Search
Exploration of Virtual Environments on Tablet: Comparison between Tactile and Tangible Interaction Techniques
Adrien Arnaud, Jean-Baptiste Corrégé, Céline Clavel, Michèle Gouiffès, and Mehdi Ammi
(CNRS, France; University of Paris-Saclay, France)
This paper presents a preliminary study which aims to investigate the tangible navigation in 3D Virtual Environments with new tablets (ex. Google Tango) that provide a self-contained 3D full tracking (ex. Google Tango, Intel RealSense). The tangible navigation was compared with tactile navigation techniques used in standard 3D applications (ex. video games, CAD, data visualization). Four conditions were compared: classic multi-touch interaction, tactile interaction with sticks, tangible interaction with a 1:2.5 scale factor and tangible interaction with a 1:5 scale factor. The study focuses on the subjective evaluation users make of the different interaction techniques in terms of usability and acceptability, and the fidelity of representation of the explored scene. Participants were asked to explore a virtual appartment using one of the four interaction techniques. Then, they were requested to draw a 2D sketch map of the explored appartment and to complete a questionnaire of usability and acceptabilty. In order to go further into the investigation, the participant's activity has also been recorded. First results show that exploring a virtual environment using only the device's movements is more usable and acceptable compared to tactile interaction technique, but the difference seems to become smaller as the used scale factor increases.
Publisher's Version Article Search

Oral Session 6: Skill Training and Assessment

Understanding the Impact of Personal Feedback on Face-to-Face Interactions in the Workplace
Afra Mashhadi, Akhil Mathur, Marc Van den Broeck, Geert Vanderhulst, and Fahim Kawsar
(Nokia Bell Labs, Ireland; Nokia Bell Labs, Belgium)
Face-to-face interactions have proven to accelerate team and larger organisation success. Many past research has explored the benefits of quantifying face-to-face interactions for informed workplace management, however to date, little attention has been paid to understand how the feedback on interaction behaviour is perceived at a personal scale. In this paper, we offer a reflection on the automated feedback of personal interactions in a workplace through a longitudinal study. We designed and developed a mobile system that captured, modelled, quantified and visualised face-to-face interactions of 47 employees for 4 months in an industrial research lab in Europe. Then we conducted semi-structured interviews with 20 employees to understand their perception and experience with the system. Our findings suggest that the short-term feedback on personal face-to-face interactions was not perceived as an effective external cue to promote self-reflection and that employees desire long-term feedback annotated with actionable attributes. Our findings provide a set of implications for the designers of future workplace technology and also opens up avenues for future HCI research on promoting self-reflection among employees.
Publisher's Version Article Search
Asynchronous Video Interviews vs. Face-to-Face Interviews For Communication Skill Measurement: A Systematic Study
Sowmya Rasipuram, Pooja Rao S. B., and Dinesh Babu Jayagopi
(IIIT Bangalore, India)
Communication skill is an important social variable in em- ployment interviews. As recent trends point to, increasingly asynchronous or interface-based video interviews are becom- ing popular. Also getting increasing interest is automatic hiring analysis, of which automatic communication skill pre- diction is one such task. In this context, a research gap that exists and which our paper addresses is â€oeAre there any differences in perception of communication skill and the accuracy of automatic prediction of say classes of communicators (e.g. those below average) when we compare interface-based and face-to-face interviews”. To this end, we have collected a set of 106 interview videos from graduate students in both the settings i.e., interface-based and face-to-face. We observe that perception of behavior of participants in interface-based (when no person is involved) vs. face-to-face (when inter- viewer is involved) according to the external naive observers is slightly different. In this paper, we present an automatic system to predict the communication skill of a person in interface-based and face-to-face interviews by automatically extracting several low level features based on audio, visual and lexical behavior of the participants and using Machine Learning algorithms like Linear Regression, Support Vec- tor Machine (SVM) and Logistic Regression. We also make an extensive study of the verbal behavior of the participant when the spoken response is obtained from manual tran- scriptions and Automatic Speech Recognition (ASR) tool. Our best automatic prediction results achieve an accuracy of 80% in interface-based and 83% in face-to-face setting.
Publisher's Version Article Search
Context and Cognitive State Triggered Interventions for Mobile MOOC Learning
Xiang Xiao and Jingtao Wang
(University of Pittsburgh, USA)
We present Context and Cognitive State triggered Feed-Forward (C2F2), an intelligent tutoring system and algorithm, to improve both student engagement and learning efficacy in mobile Massive Open Online Courses (MOOCs). C2F2 infers and responds to learners' boredom and disengagement events in real time via a combination of camera-based photoplethysmography (PPG) sensing and learning topic importance monitoring. It proactively reminds a learner of upcoming important content (feed-forward interventions) when disengagement is detected. C2F2 runs on unmodified smartphones and is compatible with courses offered by major MOOC providers. In a 48-participant user study, we found that C2F2 on average improved learning gains by 20.2% when compared with a baseline system without the feed-forward intervention. C2F2 was especially effective for the bottom performers and improved their learning gains by 41.6%. This study demonstrates the feasibility and potential of using the PPG signals implicitly recorded by the built-in camera of smartphones to facilitate mobile MOOC learning.
Publisher's Version Article Search
Native vs. Non-native Language Fluency Implications on Multimodal Interaction for Interpersonal Skills Training
Mathieu Chollet, Helmut Prendinger, and Stefan Scherer
(University of Southern California, USA; National Institute of Informatics, Japan)
New technological developments in the field of multimodal interaction show great promise for the improvement and assessment of public speaking skills. However, it is unclear how the experience of non-native speakers interacting with such technologies differs from native speakers. In particular, non-native speakers could benefit less from training with multimodal systems compared to native speakers. Additionally, machine learning models trained for the automatic assessment of public speaking ability on data of native speakers might not be performing well for assessing the performance of non-native speakers. In this paper, we investigate two aspects related to the performance and evaluation of multimodal interaction technologies designed for the improvement and assessment of public speaking between a population of English native speakers and a population of non-native English speakers. Firstly, we compare the experiences and training outcomes of these two populations interacting with a virtual audience system designed for training public speaking ability, collecting a dataset of public speaking presentations in the process. Secondly, using this dataset, we build regression models for predicting public speaking performance on both populations and evaluate these models, both on the population they were trained on and on how they generalize to the second population.
Publisher's Version Article Search


Demo Session 1

Social Signal Processing for Dummies
Ionut Damian, Michael Dietz, Frank Gaibler, and Elisabeth André
(University of Augsburg, Germany)
We introduce SSJ Creator, a modern Android GUI enabling users to design and execute social signal processing pipelines using nothing but their smartphones and without writing a single line of code. It is based on a modular Java-based social signal processing framework (SSJ), which is able to perform realtime multimodal behaviour analysis on Android devices using both device internal and external sensors.
Publisher's Version Article Search Info
Metering "Black Holes": Networking Stand-Alone Applications for Distributed Multimodal Synchronization
Michael Cohen, Yousuke Nagayama, and Bektur Ryskeldiev
(University of Aizu, Japan)

We have developed a phantom gui emulator that can read from otherwise stand-alone applications, complementing a separate parallel program that can write to such applications. In conjunction with the “Alice” desktop vr system and previously developed “Collaborative Virtual Environment” both of which are freely available, virtual scene exploration can synchronize with multimodal peers, including panoramic browsers, spatial sound renderers, and smartphone and tablet interfaces.

Publisher's Version Article Search
Towards a Multimodal Adaptive Lighting System for Visually Impaired Children
Euan Freeman, Graham Wilson, and Stephen Brewster
(University of Glasgow, UK)
Visually impaired children often have difficulty with everyday activities like locating items, e.g. favourite toys, and moving safely around the home. It is important to assist them during activities like these because it can promote independence from adults and helps to develop skills. Our demonstration shows our work towards a multimodal sensing and output system that adapts the lighting conditions at home to help visually impaired children with such tasks.
Publisher's Version Article Search
Multimodal Affective Feedback: Combining Thermal, Vibrotactile, Audio and Visual Signals
Graham Wilson, Euan Freeman, and Stephen Brewster
(University of Glasgow, UK)
In this paper we describe a demonstration of our multimodal affective feedback designs, used in research to expand the emotional expressivity of interfaces. The feedback leverages inherent associations and reactions to thermal, vibrotactile, auditory and abstract visual designs to convey a range of affective states without any need for learning feedback encoding. All combinations of the different feedback channels can be utilised, depending on which combination best conveys a given state. All the signals are generated from a mobile phone augmented with thermal and vibrotactile stimulators, which will be available to conference visitors to see, touch, hear and, importantly, feel.
Publisher's Version Article Search
Niki and Julie: A Robot and Virtual Human for Studying Multimodal Social Interaction
Ron Artstein, David Traum, Jill Boberg, Alesia Gainer, Jonathan Gratch, Emmanuel Johnson, Anton Leuski, and Mikio Nakano
(University of Southern California, USA; Honda Research Institute, Japan)
We demonstrate two agents, a robot and a virtual human, which can be used for studying factors that impact social influence. The agents engage in dialogue scenarios that build familiarity, share information, and attempt to influence a human participant. The scenarios are variants of the classical “survival task,” where members of a team rank the importance of a number of items (e.g., items that might help one survive a crash in the desert). These are ranked individually and then re-ranked following a team discussion, and the difference in ranking provides an objective measure of social influence. Survival tasks have been used in psychology, virtual human research, and human-robot interaction. Our agents are operated in a “Wizard-of-Oz” fashion, where a hidden human operator chooses the agents’ dialogue actions while interacting with an experiment participant.
Publisher's Version Article Search
A Demonstration of Multimodal Debrief Generation for AUVs, Post-mission and In-mission
Helen Hastie, Xingkun Liu, and Pedro Patron
(Heriot-Watt University, UK; SeeByte, UK)
A prototype will be demonstrated that takes activity and sensor data from Autonomous Underwater Vehicles (AUVs) and automatically generates multimodal output in the form of mission reports containing natural language and visual elements. Specifically, the system takes time-series sensor data, mission logs, together with mission plans as its input, and generates descriptions of the missions in natural language, which would be verbalised by a Text-to-Speech Synthesis (TTS) engine in a multimodal system. In addition, we will demonstrate an in-mission system that provides a stream of real-time updates in natural language, thus improving situation awareness of the operator and increasing trust in the system during missions.
Publisher's Version Article Search
Laughter Detection in the Wild: Demonstrating a Tool for Mobile Social Signal Processing and Visualization
Simon Flutura, Johannes Wagner, Florian Lingenfelser, Andreas Seiderer, and Elisabeth André
(University of Augsburg, Germany)
In this demo, we present MobileSSI, a flexible software framework for Android and embedded Linux platforms, that provides developers with tools to record, analyze and recognize human behavior in real-time on mobile devices. To illustrate the benefits of the framework for the analysis of social group dynamics in naturalistic mobile settings, we present a demonstrator for laughter recognition that was implemented with MobileSSI. The demonstrator makes use of smartphones for sensing and analyzing data and employs smartwatches and tablets for visualizing the results and providing user feedback. To enable communication within the resulting ecology of mobile devices, MobileSSI includes a web socket plugin.
Publisher's Version Article Search Info

Demo Session 2

Multimodal System for Public Speaking with Real Time Feedback: A Positive Computing Perspective
Fiona Dermody and Alistair Sutherland
(Dublin City University, Ireland)
A multimodal system for public speaking with real time feedback has been developed using the Microsoft Kinect. The system has been developed within the paradigm of positive computing which focuses on designing for user wellbeing. The system detects body pose, facial expressions and voice. Visual feedback is displayed to users on their speaking performance in real time. Users can view statistics on their utilisation of speaking modalities. The system also has a mentor avatar which appears alongside the user avatar to facilitate user training. Autocue mode allows a user to practice with set text from a chosen speech.
Publisher's Version Article Search
Multimodal Biofeedback System Integrating Low-Cost Easy Sensing Devices
Wataru Hashiguchi, Junya Morita, Takatsugu Hirayama, Kenji Mase, Kazunori Yamada, and Mayu Yokoya
(Nagoya University, Japan; Shizuoka University, Japan; Panasonic, Japan)
We built a multimodal biofeedback system integrated with low-cost sensors and applied the system to support meditation training utilizing multimodal biofeedback to the user. Our meditation support system employs Electroencephalogram, heart rate variability and eye tracking. The first two biosignals are employed to assess the mental stress during meditation, and eye tracking is equipped to detect an interval where the user engages in meditation by monitoring the open/closed states of the eyes.
Publisher's Version Article Search
A Telepresence System using a Flexible Textile Display
Kana Kushida and Hideyuki Nakanishi
(Osaka University, Japan)
In this study, we developed a telepresence system which has a flexible and deformable screen. The screen deforms in synchronization with the move of the objects on the projected video. This deformation of the display surface provides the perception of depth for the projected video. We attempted to add the perception of depth to the video of the remote person and extend a video conferencing system. The results of the experiment suggested that the perception of depth provided by system strengthens the presence of a remote person and an object on the video.
Publisher's Version Article Search
Large-Scale Multimodal Movie Dialogue Corpus
Ryu Yasuhara, Masashi Inoue, Ikuya Suga, and Tetsuo Kosaka
(Yamagata University, Japan)

We present an outline of our newly created multimodal dialogue corpus that is constructed from public domain movies. Dialogues in movies are useful sources for analyzing human communication patterns. In addition, they can be used to train machine-learning-based dialogue processing systems. However, the movie files are processing intensive and they contain large portions of non-dialogue segments. Therefore, we created a corpus that contains only dialogue segments from movies. The corpus contains 165,368 dialogue segments taken from 1,722 movies. These dialogues are automatically segmented by using deep neural network-based voice activity detection with filtering rules. Our corpus can reduce the human workload and machine-processing effort required to analyze human dialogue behavior by using movies.

Publisher's Version Article Search Info
Immersive Virtual Reality with Multimodal Interaction and Streaming Technology
Wan-Lun Tsai, You-Lun Hsu, Chi-Po Lin, Chen-Yu Zhu, Yu-Cheng Chen, and Min-Chun Hu
(National Cheng Kung University, Taiwan)
In this demo, we present an immersive virtual reality (VR) system which integrates multimodal interaction sensors (i.e., smartphone, Kinect v2, and Myo armband) and streaming technology to improve the VR experience. The integrated system solves the common problems in most VR systems: (1) the very limited playing area due to transmission cable between computer and display/interaction devices, and (2) non-intuitive way of controlling virtual objects. We use Unreal Engine 4 to develop an immersive VR game with 6 interactive levels to demonstrate the feasibility of our system. In the game, the user not only can freely walk within a large playing area surrounded by multiple Kinect sensors but also select the virtual objects to grab and throw with the Myo armband. The experiment shows that our idea is workable for VR experience.
Publisher's Version Article Search Video
Multimodal Interaction with the Autonomous Android ERICA
Divesh Lala, Pierrick Milhorat, Koji Inoue, Tianyu Zhao, and Tatsuya Kawahara
(Kyoto University, Japan)
We demonstrate an interactive conversation with an android named ERICA. In this demonstration the user can converse with ERICA on a number of topics. We demonstrate both the dialog management system and the eye gaze behavior of ERICA used for indicating attention and turn taking.
Publisher's Version Article Search
Ask Alice: An Artificial Retrieval of Information Agent
Michel Valstar, Tobias Baur, Angelo Cafaro, Alexandru Ghitulescu, Blaise Potard, Johannes Wagner, Elisabeth André, Laurent Durieu, Matthew Aylett, Soumia Dermouche, Catherine Pelachaud, Eduardo Coutinho, Björn Schuller, Yue Zhang, Dirk Heylen, Mariët Theune, and Jelte van Waterschoot
(University of Nottingham, UK; University of Augsburg, Germany; CNRS, France; Cereproc, UK; Cantoche, France; Imperial College London, UK; University of Twente, Netherlands)
We present a demonstration of the ARIA framework, a modular approach for rapid development of virtual humans for information retrieval that have linguistic, emotional, and social skills and a strong personality. We demonstrate the framework's capabilities in a scenario where `Alice in Wonderland', a popular English literature book, is embodied by a virtual human representing Alice. The user can engage in an information exchange dialogue, where Alice acts as the expert on the book, and the user as an interested novice. Besides speech recognition, sophisticated audio-visual behaviour analysis is used to inform the core agent dialogue module about the user's state and intentions, so that it can go beyond simple chat-bot dialogue. The behaviour generation module features a unique new capability of being able to deal gracefully with interruptions of the agent.
Publisher's Version Article Search
Design of Multimodal Instructional Tutoring Agents using Augmented Reality and Smart Learning Objects
Anmol Srivastava and Pradeep Yammiyavar
(IIT Guwahati, India)
This demo presents a novel technique of enriching students’ learning experience in electronic engineering laboratories and the basis for its design. The system employs mobile augmented reality (AR) and physical smart objects that can be used in conjunction to assist students in laboratories. Such systems are capable of providing just-in-time information and sensing errors made while prototyping of specific electronic circuits. These systems can help reduce cognitive load of students in laboratories and bridge gaps between theory and practical applications that students face in laboratories. Two prototypes have been developed – (i) an intelligent breadboard prototype that can sense errors like loose wiring, wrong connections, etc. for a specific experiment, and, (ii) an AR application that provides visualization and instruction for circuit assembly and operating test equipment. The intelligent breadboard acts as a smart learning object. Design methods were used to conceptualize and build such systems. The idea is to merge practices of Human Computer Interaction with those of machine learning to design highly situated physically located tutoring systems for students. Such systems can help innovatively in teaching and learning in engineering laboratories.
Publisher's Version Article Search Info
AttentiveVideo: Quantifying Emotional Responses to Mobile Video Advertisements
Phuong Pham and Jingtao Wang
(University of Pittsburgh, USA)
This demo presents AttentiveVideo, a multi-modal video player that can collect and infer viewers’ emotional responses to video advertisements on unmodified smart phones. When a subsidized video advertisement is playing, AttentiveVideo uses on-lens finger gestures for tangible video control, and employs implicit photoplethysmography (PPG) sensing to infer viewers' attention, engagement, and sentimentality toward advertisements. Through a 24-participant pilot study, we found that AttentiveVideo is easy to learn and intuitive to use. More importantly, AttentiveVideo achieved good accuracies on a wide range of emotional measures (best average accuracy = 65.9%, kappa = 0.30 across 9 metrics). Our preliminary result shows the potential of both low-cost collection and deep understanding of emotional responses to mobile video advertisements.
Publisher's Version Article Search Video
Young Merlin: An Embodied Conversational Agent in Virtual Reality
Ivan Gris, Diego A. Rivera, Alex Rayon, Adriana Camacho, and David Novick
(Inmerssion, USA)
This paper describes a system for embodied conversational agents developed by Inmerssion and one of the applications—Young Merlin: Trial by Fire —built with this system. In the Merlin application, the ECA and a human interact with speech in virtual reality. The goal of this application is to provide engaging VR experiences that build rapport through storytelling and verbal interactions. The agent is fully automated, and his attitude towards the user changes over time depending on the interaction. The conversational system was built through a declarative approach that supports animations, markup language, and gesture recognition. Future versions of Merlin will implement multi-character dialogs, additional actions, and extended interaction time.
Publisher's Version Article Search

EmotiW Challenge

EmotiW 2016: Video and Group-Level Emotion Recognition Challenges
Abhinav Dhall, Roland Goecke, Jyoti Joshi, Jesse Hoey, and Tom Gedeon
(University of Waterloo, Canada; University of Canberra, Australia; Australian National University, Australia)
This paper discusses the baseline for the Emotion Recognition in the Wild (EmotiW) 2016 challenge. Continuing on the theme of automatic affect recognition `in the wild', the EmotiW challenge 2016 consists of two sub-challenges: an audio-video based emotion and a new group-based emotion recognition sub-challenges. The audio-video based sub-challenge is based on the Acted Facial Expressions in the Wild (AFEW) database. The group-based emotion recognition sub-challenge is based on the Happy People Images (HAPPEI) database. We describe the data, baseline method, challenge protocols and the challenge results. A total of 22 and 7 teams participated in the audio-video based emotion and group-based emotion sub-challenges, respectively.
Publisher's Version Article Search Info
Emotion Recognition in the Wild from Videos using Images
Sarah Adel Bargal, Emad Barsoum, Cristian Canton Ferrer, and Cha Zhang
(Boston University, USA; Microsoft Research, USA)
This paper presents the implementation details of the proposed solution to the Emotion Recognition in the Wild 2016 Challenge, in the category of video-based emotion recognition. The proposed approach takes the video stream from the audio-video trimmed clips provided by the challenge as input and produces the emotion label corresponding to this video sequence. This output is encoded as one out of seven classes: the six basic emotions (Anger, Disgust, Fear, Happiness, Sad, Surprise) and Neutral. Overall, the system consists of several pipelined modules: face detection, image pre-processing, deep feature extraction, feature encoding and, finally, an SVM classification. This system achieves 59.42% validation accuracy, surpassing the competition baseline of 38.81%. With regard to test data, our system achieves 56.66% recognition rate, also improving the competition baseline of 40.47%.
Publisher's Version Article Search
A Deep Look into Group Happiness Prediction from Images
Aleksandra Cerekovic

We propose a Deep Neural Network (DNN) architecture to predict happiness displayed by a group of people in images. Our approach exploits both image context and individual facial information extracted from an image. The latter is explicitly modeled by Long Short Term Memory networks (LSTMs) encoding face happiness intensity and the spatial distribution of faces forming a group. LSTM outputs are combined with image context descriptors, to obtain the final group happiness score. We thoroughly evaluate our approach on the recently proposed HAPPEI dataset for group happiness prediction. Our results show that the proposed architecture outperforms a baseline CNN trained to predict group happiness directly from an image. Our method also shows an improvement of approximately 30% over the HAPPEI dataset’s baseline on the validation set.

Publisher's Version Article Search
Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu
(iQiyi, China)
In this paper, we present a video-based emotion recognition system submitted to the EmotiW 2016 Challenge. The core module of this system is a hybrid network that combines recurrent neural network (RNN) and 3D convolutional networks (C3D) in a late-fusion fashion. RNN and C3D encode appearance and motion information in different ways. Specifically, RNN takes appearance features extracted by convolutional neural network (CNN) over individual video frames as input and encodes motion later, while C3D models appearance and motion of video simultaneously. Combined with an audio module, our system achieved a recognition accuracy of 59.02% without using any additional emotion-labeled video clips in training set, compared to 53.8% of the winner of EmotiW 2015. Extensive experiments show that combining RNN and C3D together can improve video-based emotion recognition noticeably.
Publisher's Version Article Search Info
LSTM for Dynamic Emotion and Group Emotion Recognition in the Wild
Bo Sun, Qinglan Wei, Liandong Li, Qihua Xu, Jun He, and Lejun Yu
(Beijing Normal University, China)
In this paper, we describe our work in the fourth Emotion Recognition in the Wild (EmotiW 2016) Challenge. For video based emotion recognition sub-challenge, we extract acoustic features, LBPTOP, Dense SIFT and CNN-LSTM features to recognize the emotions of film characters. For group level emotion recognition sub-challenge, we use LSTM and GEM model. We train linear SVM classifiers for these kinds of features on the AFEW6.0 and HAPPEI dataset, and use a fusion network we proposed to combine all the extracted features at the decision level. The final achievements we have gained are 51.54% accuracy on the AFEW testing set and 0.836 RMSE on the HAPPEI testing set.
Publisher's Version Article Search
Multi-clue Fusion for Emotion Recognition in the Wild
Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, and Ning Sun
(Southeast University, China; Nanjing University of Posts and Telecommunications, China)

In the past three years, Emotion Recognition in the Wild (EmotiW) Grand Challenge has drawn more and more attention due to its huge potential applications. In the fourth challenge, aimed at the task of video based emotion recognition, we propose a multi-clue emotion fusion (MCEF) framework by modeling human emotion from three mutually complementary sources, facial appearance texture, facial action, and audio. To extract high-level emotion features from sequential face images, we employ a CNN-RNN architecture, where face image from each frame is first fed into the fine-tuned VGG-Face network to extract face feature, and then the features of all frames are sequentially traversed in a bidirectional RNN so as to capture dynamic changes of facial textures. To attain more accurate facial actions, a facial landmark trajectory model is proposed to explicitly learn emotion variations of facial components. Further, audio signals are also modeled in a CNN framework by extracting low-level energy features from segmented audio clips and then stacking them as an image-like map. Finally, we fuse the results generated from three clues to boost the performance of emotion recognition. Our proposed MCEF achieves an overall accuracy of 56.66% with a large improvement of 16.19% with respect to the baseline.

Publisher's Version Article Search
Multi-view Common Space Learning for Emotion Recognition in the Wild
Jianlong Wu, Zhouchen Lin, and Hongbin Zha
(Peking University, China; Shanghai Jiao Tong University, China)

It is a very challenging task to recognize emotion in the wild. Recently, combining information from various views or modalities has attracted more attention. Cross modality features and features extracted by different methods are regarded as multi-view information of the sample. In this paper, we propose a method to analyse multi-view features of emotion samples and automatically recognize the expression as part of the fourth Emotion Recognition in the Wild Challenge (EmotiW 2016). In our method, we first extract multi-view features such as BoF, CNN, LBP-TOP and audio features for each expression sample. Then we learn the corresponding projection matrices to map multi-view features into a common subspace. In the meantime, we impose l_2,1-norm penalties on projection matrices for feature selection. We apply both this method and PLSR to emotion recognition. We conduct experiments on both AFEW and HAPPEI datasets, and achieve superior performance. The best recognition accuracy of our method is 0.5531 on the AFEW dataset for video based emotion recognition in the wild. The minimum RMSE for group happiness intensity recognition is 0.9525 on HAPPEI dataset. Both of them are much better than that of the challenge baseline.

Publisher's Version Article Search
HoloNet: Towards Robust Emotion Recognition in the Wild
Anbang Yao, Dongqi Cai, Ping Hu, Shandong Wang, Liang Sha, and Yurong Chen
(Intel Labs, China; Beihang University, China)
In this paper, we present HoloNet, a well-designed Convolutional Neural Network (CNN) architecture regarding our submissions to the video based sub-challenge of the Emotion Recognition in the Wild (EmotiW) 2016 challenge. In contrast to previous related methods that usually adopt relatively simple and shallow neural network architectures to address emotion recognition task, our HoloNet has three critical considerations in network design. (1) To reduce redundant filters and enhance the non-saturated non-linearity in the lower convolutional layers, we use a modified Concatenated Rectified Linear Unit (CReLU) instead of ReLU. (2) To enjoy the accuracy gain from considerably increased network depth and maintain efficiency, we combine residual structure and CReLU to construct the middle layers. (3) To broaden network width and introduce multi-scale feature extraction property, the topper layers are designed as a variant of inception-residual structure. The main benefit of grouping these modules into the HoloNet is that both negative and positive phase information implicitly contained in the input data can flow over it in multiple paths, thus deep multi-scale features explicitly capturing emotion variation can be well extracted from multi-path sibling layers, and then can be further concatenated for robust recognition. We obtain competitive results in this year’s video based emotion recognition sub-challenge using an ensemble of two HoloNet models trained with given data only. Specifically, we obtain a mean recognition rate of 57.84%, outperforming the baseline accuracy with an absolute margin of 17.37%, and yielding 4.04% absolute accuracy gain compared to the result of last year’s winner team. Meanwhile, our method runs with a speed of several thousands of frames per second on a GPU, thus it is well applicable to real-time scenarios.
Publisher's Version Article Search
Group Happiness Assessment using Geometric Features and Dataset Balancing
Vassilios Vonikakis, Yasin Yazici, Viet Dung Nguyen, and Stefan Winkler
(Advanced Digital Sciences Center at University of Illinois, Singapore; Nanyang Technological University, Singapore)
This paper presents the techniques employed in our team's submissions to the 2016 Emotion Recognition in the Wild contest, for the sub-challenge of group-level emotion recognition. The objective of this sub-challenge is to estimate the happiness intensity of groups of people in consumer photos. We follow a predominately bottom-up approach, in which the individual happiness level of each face is estimated separately. The proposed technique is based on geometric features derived from 49 facial points. These features are used to train a model on a subset of the HAPPEI dataset, balanced across expression and headpose, using Partial Least Squares regression. The trained model exhibits competitive performance for a range of non-frontal poses, while at the same time offering a semantic interpretation of the facial distances that may contribute positively or negatively to group-level happiness. Various techniques are explored in combining these estimations in order to perform group-level prediction, including the distribution of expressions, significance of a face relative to the whole group, and mean estimation. Our best submission achieves an RMSE of 0.8316 on the competition test set, which compares favorably to the RMSE of 1.30 of the baseline.
Publisher's Version Article Search Video
Happiness Level Prediction with Sequential Inputs via Multiple Regressions
Jianshu Li, Sujoy Roy, Jiashi Feng, and Terence Sim
(National University of Singapore, Singapore; SAP, Singapore)

This paper presents our solution submitted to the Emotion Recognition in the Wild (EmotiW 2016) group-level happiness intensity prediction sub-challenge. The objective of this sub-challenge is to predict the overall happiness level given an image of a group of people in a natural setting. We note that both the global setting and the faces of the individuals in the image influence the group-level happiness intensity of the image. Hence the challenge lies in building a solution that incorporates both these factors and also considers their right combination. Our proposed solution incorporates both these factors as a combination of global and local information. We use a convolutional neural network to extract discriminative face features, and a recurrent neural network to selectively memorize the important features to perform the group-level happiness prediction task. Experimental evaluations show promising performance improvements, resulting in Root Mean Square Error (RMSE) reduction of about 0.5 units on the test set compared to the baseline algorithm that uses only global information.

Publisher's Version Article Search
Video Emotion Recognition in the Wild Based on Fusion of Multimodal Features
Shizhe Chen, Xinrui Li, Qin Jin, Shilei Zhang, and Yong Qin
(Renmin University of China, China; IBM Research, China)

In this paper, we present our methods to the Audio-Video Based Emotion Recognition subtask in the 2016 Emotion Recognition in the Wild (EmotiW) Challenge. The task is to predict one of the seven basic emotions for the characters in the video clips extracted from movies or TV shows. In our approach, we explore various multimodal features from audio, facial image and video motion modalities. The audio features contain statistical acoustic features, MFCC Bag-of-Audio-Words and MFCC Fisher Vectors. For image related features, we extract hand-crafted features (LBP-TOP and SPM Dense SIFT) and learned features (CNN features). The improved Dense Trajectory is used as the motion related features. We train SVM, Random Forest and Logistic Regression classifiers for each kind of feature. Among them, MFCC fisher vector is the best acoustic features and the facial CNN feature is the most discriminative feature for emotion recognition. We utilize late fusion to combine different modality features and achieve a 50.76% accuracy on the testing set, which significantly outperforms the baseline test accuracy of 40.47%.

Publisher's Version Article Search
Wild Wild Emotion: A Multimodal Ensemble Approach
John Gideon, Biqiao Zhang, Zakaria Aldeneh, Yelin Kim, Soheil Khorram, Duc Le, and Emily Mower Provost
(University of Michigan, USA; SUNY Albany, USA)
Automatic emotion recognition from audio-visual data is a topic that has been broadly explored using data captured in the laboratory. However, these data are not necessarily representative of how emotion is manifested in the real-world. In this paper, we describe our system for the 2016 Emotion Recognition in the Wild challenge. We use the Acted Facial Expressions in the Wild database 6.0 (AFEW 6.0), which contains short clips of popular TV shows and movies and has more variability in the data compared to laboratory recordings. We explore a set of features that incorporate information from facial expressions and speech, in addition to cues from the background music and overall scene. In particular, we propose the use of a feature set composed of dimensional emotion estimates trained from outside acoustic corpora. We design sets of multiclass and pairwise (one-versus-one) classifiers and fuse the resulting systems. Our fusion increases the performance from a baseline of 38.81% to 43.86% and from 40.47% to 46.88%, for validation and test sets, respectively. While the video features perform better than audio features alone, a combination of the two modalities achieves the greatest performance, with gains of 4.4% and 1.4%, with and without information gain, respectively. Because of the flexible design of the fusion, it is easily adaptable to other multimodal learning problems.
Publisher's Version Article Search
Audio and Face Video Emotion Recognition in the Wild using Deep Neural Networks and Small Datasets
Wan Ding, Mingyu Xu, Dongyan Huang, Weisi Lin, Minghui Dong, Xinguo Yu, and Haizhou Li
(Central China Normal University, China; University of British Columbia, Canada; A*STAR, Singapore; Nanyang Technological University, Singapore; National University of Singapore, Singapore)
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2016’s video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear & disgust) and neutral. Compared to earlier years’ movie based datasets, this year’s test dataset introduced reality TV videos containing more spontaneous emotion. Our proposed solution is the fusion of facial expression recognition and audio emotion recognition subsystems at score level. For facial emotion recognition, starting from a network pre-trained on ImageNet training data, a deep Convolutional Neural Network is fine-tuned on FER2013 training data for feature extraction. The classifiers, i.e., kernel SVM, logistic regression and partial least squares are studied for comparison. An optimal fusion of classifiers learned from different kernels is carried out at the score level to improve system performance. For audio emotion recognition, a deep Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) is trained directly using the challenge dataset. Experimental results show that both subsystems individually and as a whole can achieve state-of-the art performance. The overall accuracy of the proposed approach on the challenge test dataset is 53.9%, which is better than the challenge baseline of 40.47% .
Publisher's Version Article Search
Automatic Emotion Recognition in the Wild using an Ensemble of Static and Dynamic Representations
Mostafa Mehdipour Ghazi and Hazım Kemal Ekenel
(Sabanci University, Turkey; Istanbul Technical University, Turkey)
Automatic emotion recognition in the wild video datasets is a very challenging problem because of the inter-class similarities among different facial expressions and large intra-class variabilities due to the significant changes in illumination, pose, scene, and expression. In this paper, we present our proposed method for video-based emotion recognition in the EmotiW 2016 challenge. The task considers the unconstrained emotion recognition problem by training from short video clips extracted from movies and testing on short movie clips and spontaneous video clips of the reality TV data. Four different methods are employed to extract both static and dynamic emotion representations from the videos. First, local binary patterns of three orthogonal planes are used to describe spatiotemporal features of the video frames. Second, principal component analysis is applied to the image patches in a two-stage convolutional network to learn weights and extract facial features from the aligned faces. Third, the deep convolutional neural network model of VGG-Face is deployed to extract deep facial representations from aligned faces. Fourth, a bag of visual words is computed based on dense scale-invariant feature transform descriptors from aligned face images to form hand-crafted representations. Support vector machines are then utilized to train and classify the obtained spatiotemporal representations and facial features. Finally, score-level fusion is applied to combine the classification results and predict the emotion labels of the video clips. The results show that the proposed combined method has outperformed all the utilized techniques with the overall validations and test accuracies of 43.13% and 40.13%, respectively. This system, is relatively a good classifier in Happy and Angry emotion categories and is unsuccessful in detecting Surprise, Disgust, and Fear.
Publisher's Version Article Search

Doctoral Consortium

The Influence of Appearance and Interaction Strategy of a Social Robot on the Feeling of Uncanniness in Humans
Maike Paetzel
(Uppsala University, Sweden)
Most research on the uncanny valley effect is concerned with the influence of unimodal visual cues in the dimensions human-likeness and realism as a trigger of an uncanny feeling in humans. This leads to a lack of investigation how multimodality affects the feeling of uncanniness. In our research project, we use the back-projected robot head Furhat to study the influence of multimodal cues in facial texture, expressions, voice and behaviour to broaden the understanding of the underlying cause of the uncanny valley effect. Up to date, we mainly investigated the general perception of uncanniness in a back-projected head, with a special focus on multimodal gender cues. In the upcoming years, the focus shall shift towards interaction strategies of social robots and their interplay with the robot's appearance.
Publisher's Version Article Search
Viewing Support System for Multi-view Videos
Xueting Wang
(Nagoya University, Japan)
Multi-view videos taken by multiple cameras from different angles are expected to be useful in a wide range of applications, such as web lecture broadcasting, concerts and sports viewing, etc. These videos can enhancing viewing experience of users' personal preference through means of virtual camera switching and controlling viewing interfaces. However, the increasing number of cameras burdens even experts on suitable viewpoint selection. Thus, my doctoral research goal is to construct a system providing convenient and high quality viewing support for personal multi-view video viewing. We intend to include 3 parts: automatic viewpoint sequence recommendation, multimodal user feedback analysis, and on-line recommendation updating. Prior works focused on automatic viewpoint sequence recommending considering contextual information and user preference. We proposed a context-dependent recommending model and improved by considering the spatio-temporal contextual information. Further work will concentrate on analyzing multimodal user feedback while viewing recommendations to detect the unsatisfactory timing and model the user preference of viewpoint switching. The switching records and multimodal feedback can be used for on-line recommendation updating to improve the personal viewing support.
Publisher's Version Article Search
Engaging Children with Autism in a Shape Perception Task using a Haptic Force Feedback Interface
Alix Pérusseau-Lambert
(CEA LIST, France)
Atypical sensori-motor reactions and social interaction difficulties are commonly reported in Autism Spectrum Disorders (ASD). In our work, we consider the possible relationship between using our sense of touch to discover our environment and the development of interaction competencies, in the particular case of ASD. For this purpose we will use a task of shape perception, and a haptic feedback device. The design of the haptic device will be based on the State of the Art and advices from experts in ASD. We plan to work on the needs and skills of children on the spectrum in the lower range of intelligence scores. This article details the development of our PhD project, which aims at designing efficient and reliable haptic force feedback interfaces and interactions for ASD user. Preliminary results on the development of our interface and the design of the interaction tasks are presented as well as our specific contributions.
Publisher's Version Article Search
Modeling User's Decision Process through Gaze Behavior
Kei Shimonishi
(Kyoto University, Japan)
When we choose items among alternatives, we sometimes face a problem of mismatch between what we actually want and selected items. Therefore, if an interactive system can probe our interests from several modalities (e.g, eye movements and speech recognition) and decrease these mismatch, the system can be helpful for decision making with a satisfaction. In order to build such interactive decision support systems, the systems need to estimate both users' interests (selection criteria for that decision) and users' knowledge about the content domain. Here, not only users' knowledge but also users' selection criteria can be changed; for example, users' selection criteria converge as a reaction to system's recommendation. Therefore, the system needs to understand the dynamics of users' selection criteria in order to choose appropriate actions. What makes more difficult is that the dynamics of users' internal states themselves can change depend on a phase of decision making. Therefore, we need to address (a) how to represent users' internal states, (b) how to estimate and trace temporal changes of users' internal state, and (c) how to trace users' decision phase so that system can decide actions for decision assistance. In order to tackle these problems, we propose a novel representation of users' internal state, which consists of selection criteria and structures of users' knowledge about the content domain and a method to estimate these selection criteria from gaze behavior. In addition, we consider the multiscale dynamics of users' internal states so that the system can trace users' decision phase as temporal changes of dynamics of users' internal states.
Publisher's Version Article Search
Multimodal Positive Computing System for Public Speaking with Real-Time Feedback
Fiona Dermody
(Dublin City University, Ireland)
A multimodal system with real-time feedback for public speaking has been developed. The system has been developed within the paradigm of positive computing which focuses on designing for user wellbeing. To date we have focused on the following determinants of wellbeing – autonomy, self-awareness and stress reduction. Two system prototypes have been developed which differ in the way they display feedback to users. One prototype displays peripheral feedback and the other displays line-of-vision feedback. Initial user evaluation of the system prototypes has yielded positive results. All users reported that they preferred the prototype with line-of-vision feedback. Users reported having autonomy in choosing what visual feedback to focus on when using the system. They also reported that they gained self-awareness as a speaker from using the system.
Publisher's Version Article Search
Prediction/Assessment of Communication Skill using Multimodal Cues in Social Interactions
Sowmya Rasipuram
(IIIT Bangalore, India)
Understanding people’s behavior in social interactions is a very interesting problem in Social Computing. In this work, we automatically predict the communication skill of a person in various kinds of social interactions. We consider in particular, 1) Interview-based interactions - asynchronous interviews (web-based interview) Vs. synchronous interviews (regular face-to-face interviews) and 2) Non-interview based interactions - dyad and triad conversations (group discussions). We automatically extract multimodal cues related to verbal and non-verbal behavior content of the interaction. First, in interview-based interactions, we consider previously uninvestigated scenarios of comparing the participant’s behavioral and perceptual changes in the two contexts. Second, we address different manifestations of communication skill in different settings (face-to-face interaction vs. group). Third, the non-interview based interactions also leads to answer research questions such as “the relation between a good communicator and other group variables like dominance or leadership” Finally we look at several attributes (manually annotated) and features/feature groups (automatically extracted) that predicts communication skill well in all settings.
Publisher's Version Article Search
Player/Avatar Body Relations in Multimodal Augmented Reality Games
Nina Rosa
(Utrecht University, Netherlands)
Augmented reality research is finally moving towards multimodal experiences: more and more applications do not only include visuals, but also audio and even haptics. The purpose of multimodality in these applications can be to increase realism or to increase the amount or quality of communicated information. One particularly interesting and increasingly important application area is AR gaming, where the player can experience the virtual game integrated into the real environment and interact with it in a multimodal fashion. Currently, many games are set up such that the interaction is local (direct), however there are many cases in which remote (indirect) interaction will be useful or even necessary. In the latter case, the actions can be expressed through a virtual avatar, while the player's real body is also still perceivably present. The player then controls the motions and actions of the avatar, and receives multimodal feedback associated to the events occurring in the game. Can it be that the player starts to perceive the avatar as a (part of) him- or herself? Or does something even more intense take place? What are the benefits of this experience? The core of this research is to understand how multimodal perceptual configuration plays a role in the relation between a player and their in-game avatar.
Publisher's Version Article Search
Computational Model for Interpersonal Attitude Expression
Soumia Dermouche
(CNRS, France; Telecom ParisTech, France)
This paper presents a plan towards a computational model of interpersonal attitudes and its integration in an embodied conversational agent (ECA). The goal is to endow an ECA with the capacity to express different interpersonal attitudes depending on the interaction context. Interpersonal attitudes can be represented by sequences of non-verbal behaviors. In our work, we rely on temporal sequence mining algorithms to extract, from a multimodal corpus, a set of temporal patterns representing interpersonal attitudes. Specifically, we propose a new temporal sequence mining algorithm called HCApriori and we evaluate it against four state-of-the-art algorithms. Results show a significant improvement of HCApriori over the other algorithms in terms of both pattern extraction accuracy and running time. The next step is to implement the temporal patterns extracted with HCApriori on an ECA.
Publisher's Version Article Search
Assessing Symptoms of Excessive SNS Usage Based on User Behavior and Emotion
Ploypailin Intapong, Tipporn Laohakangvalvit, Tiranee Achalakul, and Michiko Ohkura
(Shibaura Institute of Technology, Japan; King Mongkut’s University of Technology Thonburi, Thailand)
The worldwide use of social networking sites (SNSs) continues to dramatically increase. People are spending unexpected and unprecedented amounts of time online. However, many studies have issued warnings about the negative consequences of excessive SNS usage, including the risk of addictive behavior. This research is conducted to detect the symptoms of excessive SNS use by studying user behaviors and emotions in SNSs. We employed questionnaires, SNS APIs, and biological signals as methods. The data obtained from the study will characterize SNS usage to detect excessive use. Finally, the analytic results will be applied for developing prevention strategies to increase the awareness of the risks of excessive SNS usage.
Publisher's Version Article Search
Kawaii Feeling Estimation by Product Attributes and Biological Signals
Tipporn Laohakangvalvit, Tiranee Achalakul, and Michiko Ohkura
(Shibaura Institute of Technology, Japan; King Mongkut’s University of Technology Thonburi, Thailand)
Kansei values are critical factors in manufacturing in Japan. As one kansei value, kawaii, which is a positive adjective that denotes such positive connotations as cute, lovable, and charming, is becoming more important. Our research systematically studies kawaii feelings by eye tracking and biological signals. We will use our results to construct a mathematical model of kawaii feelings that can be applied to the future design and development of kawaii products.
Publisher's Version Article Search
Multimodal Sensing of Affect Intensity
Shalini Bhatia
(University of Canberra, Australia)

Most research on affect intensity has relied on the Affect Intensity Measure (AIM) of self-report that asks respondents to rate how often they react to situations with strong emotions. The AIM gives an indication of how strongly or weakly individuals tend to experience emotions in their everyday life. In this PhD project, I plan to quantify the affect intensity on a continuous scale using multiple modalities of video and audio on real-world, clinically validated depression datasets. Most of the work in this area treats the problem as a binary classification problem, mainly due to the lack of dimensional data. As the depression severity of a subject increases, as seen in the case of melancholia, the facial movements become very subtle. In order to quantify depression in general, and subtypes such as melancholia in particular, we need to reveal these subtle changes. To do this, I propose to use video magnification approaches. Inspired by the success of deep learning in video classification, I plan on using deep learning for information fusion over multiple modalities, such as Convolutional Neural Networks and Long Short Term Memory Networks. Using the common approach to video classification, i.e. local feature extraction, fixed size video level description and training a classifier on the resulting bag of words representation, I present preliminary results on the classification of melancholic and non-melancholic depressed subjects and healthy controls, which will serve as a baseline for future development in depression classification and analysis. I have also compared the sensitivity and specificity of classification in depression sub-types.

Publisher's Version Article Search
Enriching Student Learning Experience using Augmented Reality and Smart Learning Objects
Anmol Srivastava
(IIT Guwahati, India)
Physical laboratories in Electronic Engineering curriculum play a crucial role in enabling students to gain "hands-on" learning experience to get a feel for problem-solving. However, students often feel frustrated in these laboratories due to procedural difficulties and disconnects that exist between theory and practice. This impedes their learning and causes them to lose interest in the practical experiment. This research considers the approach of ubiquitous computing to address this issue by embedding computational capabilities into commonly used physical objects in electronics lab (e.g. breadboard) and making use of mobile Augmented Reality application to assist students. Two working prototypes have been proposed as a proof-of-concept. These are (i) an AR based lab manual and circuit building application, and, (ii) Intelligent Breadboard - which is capable of sensing errors made by students. It is posited that such systems can help reduce cognitive load and bridge gaps between theory and practical applications that students face in laboratories.
Publisher's Version Article Search
Automated Recognition of Facial Expressions Authenticity
Krystian Radlak and Bogdan Smolka
(Silesian University of Technology, Poland)
Recognition of facial expressions authenticity is quite troublesome for humans. Therefore, it is an interesting topic for the computer vision community, as the developed algorithms for facial expressions authenticity estimation may be used as indicators of deception. This paper discusses the state-of-the art methods developed for smile veracity estimation and proposes a plan of development and validation of a novel approach to automated discrimination between genuine and posed facial expressions. The proposed fully automated technique is based on the extension of the high-dimensional Local Binary Patterns (LBP) to the spatio-temporal domain and combines them with the dynamics of facial landmarks movements. The proposed technique will be validated on several existing smile databases and a novel database created with the use of a high speed camera. Finally, the developed framework will be applied for the detection of deception in real life scenarios.
Publisher's Version Article Search
Improving the Generalizability of Emotion Recognition Systems: Towards Emotion Recognition in the Wild
Biqiao Zhang
(University of Michigan, USA)
Emotion recognition in the wild requires the ability to adapt to complex and changeable application scenarios, which necessitates the generalizability of automatic emotion recognition systems. My PhD thesis focuses on methods to address factors that negatively impact the generalizability of automatic emotion recognition systems, such as the ambiguity in emotion labels, the effects of expression style (e.g., speech and music), variation in recording environments, and individual differences. In particular, I propose to tease apart the influence of these factors from emotion using multi-task learning for both feature learning and emotion inference. Results from my completed works have demonstrated that classifiers that take the influence of corpus (simulating environmental differences), expression style and gender of speaker into consideration generalize better across corpus, compared to those that do not.
Publisher's Version Article Search


Grand Challenge Summary

Emotion Recognition in the Wild Challenge 2016
Abhinav Dhall, Roland Goecke, Jyoti Joshi, and Tom Gedeon
(University of Waterloo, Canada; University of Canberra, Australia; Australian National University, Australia)
The fourth Emotion Recognition in the Wild (EmotiW) challenge is a grand challenge in the ACM International Conference on Multimodal Interaction 2016, Tokyo. EmotiW is a series of benchmarking and competition effort for researchers working in the area of automatic emotion recognition in the wild. The fourth EmotiW has two sub-challenges: Video based emotion recognition (VReco) and Group-level emotion recognition (GReco). The VReco sub-challenge is being run for the fourth time and GReco is a new sub-challenge this year.
Publisher's Version Article Search Info

Workshop Summaries

1st International Workshop on Embodied Interaction with Smart Environments (Workshop Summary)
Patrick Holthaus, Thomas Hermann, Sebastian Wrede, Sven Wachsmuth, and Britta Wrede
(Bielefeld University, Germany)
The first workshop on embodied interaction with smart environments aims to bring together the very active community of multi-modal interaction research and the rapidly evolving field of smart home technologies. Besides addressing the software architecture of such very complex systems, it puts an emphasis on questions regarding an intuitive interaction with the environment. Thereby, especially the role of agency leads to interesting challenges in the light of user interactions. We therefore encourage a lively discussion on the design and concepts of social robots and virtual avatars as well as innovative ambient devices and their implementation into smart environments.
Publisher's Version Article Search
ASSP4MI2016: 2nd International Workshop on Advancements in Social Signal Processing for Multimodal Interaction (Workshop Summary)
Khiet P. Truong, Dirk Heylen, Toyoaki Nishida, and Mohamed Chetouani
(University of Twente, Netherlands; Kyoto University, Japan; UPMC, France)
This paper gives a summary of the 2nd International Workshop on Advancements in Social Signal Processing for Multimodal Interaction (ASSP4MI). Following our successful 1st International Workshop on Advancements in Social Signal Processing for Multimodal Interaction, held during ICMI-2015, we proposed the 2nd ASSP4MI workshop during ICMI-2016. The topics addressed and discussions fostered during last year's workshop are considered very relevant and alive in the research community. In this year's workshop, we continued addressing important topics and fostering fruitful discussions among researchers from different disciplines working in the fields of Social Signal Processing (SSP) and multimodal interaction.
Publisher's Version Article Search
ERM4CT 2016: 2nd International Workshop on Emotion Representations and Modelling for Companion Systems (Workshop Summary)
Kim Hartmann, Ingo Siegert, Ali Albert Salah, and Khiet P. Truong
(University of Magdeburg, Germany; Bogazici University, Turkey; University of Twente, Netherlands)
In this paper the organisers present a brief overview of the 2nd International Workshop on Emotion Representations and Modelling for Companion Systems (ERM4CT). The ERM4CT 2016 Workshop is held in conjunction with the 18th ACM International Conference on Multimodal Interaction (ICMI 2016) taking place Tokyo, Japan. The ERM4CT is the follow-up of three previous workshops on emotion modelling for affective human-computer interaction and companion systems. Apart from its usual focus on emotion representations and models, this year's ERM4CT puts special emphasis on how to model adequate affective system behaviour. For the first time, this year's ERM4CT gave out a dataset, which all attendees could investigate to jointly discuss their findings.
Publisher's Version Article Search
International Workshop on Multimodal Virtual and Augmented Reality (Workshop Summary)
Wolfgang Hürst, Daisuke Iwai, and Prabhakaran Balakrishnan
(Utrecht University, Netherlands; Osaka University, Japan; University of Texas at Dallas, USA)
Virtual reality (VR) and augmented reality (AR) are expected by many to become the next wave of computing with significant impacts on our daily lives. Motivated by this, we organized a workshop on “Multimodal Virtual and Augmented Reality (MVAR)” at the 18th ACM International Conference on Multimodal Interaction (ICMI 2016). While current VR and AR installations mostly focus on the visual domain, we expect multimodality to play a crucial role in future, next generation VR/AR systems. The submissions for this workshop reflect the potential of multimodality for VR and AR, illustrate interesting new directions, and pinpoint important issues. This paper gives a short motivation for the workshop and its aim, and summarizes the aforementioned trends and challenges identified from the submissions.
Publisher's Version Article Search
International Workshop on Social Learning and Multimodal Interaction for Designing Artificial Agents (Workshop Summary)
Mohamed Chetouani, Salvatore M. Anzalone, Giovanna Varni, Isabelle Hupont Torres, Ginevra Castellano, Angelica Lim, and Gentiane Venture
(UPMC, France; Paris 8 University, France; Uppsala University, Sweden; SoftBank Robotics, France; Tokyo University of Agriculture and Technology, Japan)
The “social learning and multimodal interaction for designing artificial agents” workshop aims at presenting scientific and philosophical advances related to social learning and multimodal interaction for enhancing the design of artificial agents. Papers presented in the workshop include studies on human behavior modeling, on social robotics and on virtual agents. Our two invited speakers, Prof. Catherine Pelachaud and Prof. Louis-Philippe Morency will enrich and open the door to further discussion by bringing their widely acknowledged expertise in the field.
Publisher's Version Article Search
1st International Workshop on Multi-sensorial Approaches to Human-Food Interaction (Workshop Summary)
Anton Nijholt, Carlos Velasco, Kasun Karunanayaka, and Gijs Huisman
(University of Twente, Netherlands; BI Norwegian Business School, Norway; University of Oxford, UK; Imagineering Institute, Malaysia)
This is an introductory paper for the workshop entitled ‘Multi-Sensorial Approaches to Human-Food Interaction’ held at ICMI 2016, which took place the 16th of November, 2016 in Tokyo, Japan. Here we discuss our objectives and the relevance of the workshop, and summarize the key contributions of the position papers. We were able to gather a group of researchers from different countries in Europe and Asia who presented their research and discussed the current developments, trends, limitations, and future applications of the field. Whilst this is the first workshop of its kind, we anticipate that the field of multisensory Human-Food Interaction (HFI) will grow in the upcoming years in terms of research and development, and its products will impact our everyday eating experiences.
Publisher's Version Article Search
International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-­Machine Interaction (Workshop Summary)
Ronald Böck, Francesca Bonin, Nick Campbell, and Ronald Poppe
(University of Magdeburg, Germany; IBM Research, Ireland; Trinity College Dublin, Ireland; Utrecht University, Netherlands)
In this paper a brief overview of the third workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction. The paper is focussing on the main aspects intended to be discussed in the workshop reflecting the main scope of the papers presented during the meeting. The MA3HMI 2016 workshop is held in conjunction with the 18th ACM International Conference on Mulitmodal Interaction (ICMI 2016) taking place in Tokyo, Japan, in November 2016. This year, we have solicited papers concerning the different phases of the development of multimodal systems. Tools and systems that address real-time conversations with artificial agents and technical systems are also within the scope.
Publisher's Version Article Search

proc time: 6.01