ICMI 2017
19th ACM International Conference on Multimodal Interaction (ICMI 2017)
Powered by
Conference Publishing Consulting

19th ACM International Conference on Multimodal Interaction (ICMI 2017), November 13–17, 2017, Glasgow, UK

ICMI 2017 – Proceedings

Contents - Abstracts - Authors


Title Page

Message from the Chairs
Welcome to Glasgow and to the 19th ACM International Conference on Multimodal Interaction, ICMI 2017. Over the years, the conference has become the premier venue for all researchers interested in human-human and human-machine multimodal interaction. The contributions that will be presented during the conference - not only papers, but also demonstrations and tutorials - will cover such a highly interdisciplinary domain in all its aspects, from theoretic foundations to empirical and experimental studies, from sensors to interfaces, from low-level signal processing to high-level semantic interpretation of data, and from the design of component technologies to the development of fully integrated systems. The special topic of interest of ICMI 2017 is Human-Computer Interaction, in recognition not only of the importance that the area has in the conference, but also of the vibrant research community that works on the topic in Scotland.
ICMI 2017 Organization
Supporters and Sponsors
Supporters and Sponsors

Invited Talks

Gastrophysics: Using Technology to Enhance the Experience of Food and Drink (Keynote)
Charles Spence
(University of Oxford, UK)
Currently, technology mostly distracts us from what we are eating and drinking. As hand-held devices continues to evolve, food and drink will increasingly have to fight with our smart phones, rather than the TV, for our attention. Some chefs have already responded to the challenge by trying to ban technology at the dinner table. However, I am optimistic that, in the years to come, technology will rather become integral to our food and drink experiences: Everything from using your tablet as 21st century plateware (now that they are dishwasher safe), through to using hand-held technologies to provide a dash of sonic (digital) seasoning – that is, providing the right sonic backdrop (be it music or soundscapes) matched to bring out the best in whatever we happen to be eating or drinking. In this talk, I will highlight the ways in which technology will, and will not, change our experience of food and drink in the years to come. I will give examples from modernist chefs and molecular mixologists that are already starting to transform our mainstream experience – be it in the air or in the home environment.
Publisher's Version Article Search
Collaborative Robots: From Action and Interaction to Collaboration (Keynote)
Danica Kragic
(KTH, Sweden)
The integral ability of any robot is to act in the environment, interact and collaborate with people and other robots. Interaction between two agents builds on the ability to engage in mutual prediction and signaling. Thus, human-robot interaction requires a system that can interpret and make use of human signaling strategies in a social context. Our work in this area focuses on developing a framework for human motion prediction in the context of joint action in HRI. We base this framework on the idea that social interaction is highly influences by sensorimotor contingencies (SMCs). Instead of constructing explicit cognitive models, we rely on the interaction between actions the perceptual change that they induce in both the human and the robot. This approach allows us to employ a single model for motion prediction and goal inference and to seamlessly integrate the human actions into the environment and task context. The current trend in computer vision is development of data-driven approaches where the use of large amounts of data tries to compensate for the complexity of the world captured by cameras. Are these approaches also viable solutions in robotics? Apart from 'seeing', a robot is capable of acting, thus purposively change what and how it sees the world around it. There is a need for an interplay between processes such as attention, segmentation, object detection, recognition and categorization in order to interact with the environment. In addition, the parameterization of these is inevitably guided by the task or the goal a robot is supposed to achieve. In this talk, I will present the current state of the art in the area of robot vision and discuss open problems in the area. I will also show how visual input can be integrated with proprioception, tactile and force-torque feedback in order to plan, guide and assess robot's action and interaction with the environment. We employ a deep generative model that makes inferences over future human motion trajectories given the intention of the human and the history as well as the task setting of the interaction. With help predictions drawn from the model, we can determine the most likely future motion trajectory and make inferences over intentions and objects of interest.
Publisher's Version Article Search
Situated Conceptualization: A Framework for Multimodal Interaction (Keynote)
Lawrence Barsalou
(University of Glasgow, UK)
One way of construing brain organization is as a collection of neural systems that processes the components of a situation in parallel, including its setting, agents, objects, self-relevance, internal states, actions, outcomes, etc. In a given situation, each situational component is conceptualized individually, as when components of eating in a kitchen are conceptualized as kitchen (setting), diner (agent), food (food), hunger (internal state), and chewing (action). In turn, global concepts integrate these individual conceptualizations into larger structures that conceptualize the situation as a whole, such as eating and meal. From this perspective, a situated conceptualization is a distributed record of conceptual processing in a given situation, across all the relevant component systems each distributed throughout the brain. On later occasions, when cued by something in the external or internal environment, a situated conceptualization becomes active to simulate the respective situation in its absence, producing multimodal pattern-completion inferences that guide situated action (e.g., activating a situated conceptualization to simulate and control eating). From this perspective, the concept that represents a category, such as kitchen or eating, is the collection of situated conceptualizations that has accumulated from processing the category across situations, similar to exemplar theories. The utility of situated conceptualization as a general theoretical construct is illustrated for situated action, social priming, social mirroring, emotion, and appetitive behaviors, as well as for habits and individual differences.
Publisher's Version Article Search
Steps towards Collaborative Multimodal Dialogue (Sustained Contribution Award)
Phil Cohen
(Voicebox Technologies, USA)
This talk will discuss progress in building collaborative multimodal systems, both systems that offer a collaborative interface that augments human performance, and autonomous systems with which one can collaborate. To begin, I discuss what we will mean by collaboration, which revolves around plan recognition skills learned as a child. Then, I present a collaborative multimodal operations planning system, Sketch-Thru-Plan, that enables users to interact multimodally with speech and pen as it attempts to infer their plans. The system offers suggested actions and allows the user to confirm/disconfirm those suggestions. I show how the collaborative multimodal interface enables more rapid task performance and higher user satisfaction than existing deployed GUIs built for the same task. In the second part of the talk, I discuss the differences for system design between building such a collaborative multimodal interface and building an autonomous agent with which one can collaborate through multimodal dialogue. I argue that interacting with an autonomous agent (e.g., a robot or virtual assistant) may require a more declarative approach to supporting collaborative communication. People’s deeply engrained collaboration strategies will be seen to be at the foundation of dialogue and are expected by human interlocutors. The approach I will advocate to implementing such a strategy is to build a belief-desire-intention (BDI) architecture that attempts to recognize the collaborator’s plans, and determine obstacles to their success. The system then plans and executes a response to overcome those obstacles, which results in the system’s planning appropriate actions (including speech acts). I will illustrate and demonstrate a system that embodies this type of collaboration, engaging users in dialogue about travel planning. Finally, I will compare this approach with current academic and research approaches to dialogue.
Publisher's Version Article Search

Main Track

Oral Session 1: Children and Interaction

Tablets, Tabletops, and Smartphones: Cross-Platform Comparisons of Children’s Touchscreen Interactions
Julia Woodward, Alex Shaw, Aishat Aloba, Ayushi Jain, Jaime Ruiz, and Lisa Anthony
(University of Florida, USA)
The proliferation of smartphones and tablets has increased children’s access to and usage of touchscreen devices. Prior work on smartphones has shown that children’s touch interactions differ from adults’. However, larger screen devices like tablets and tabletops have not been studied at the same granularity for children as smaller devices. We present two studies: one of 13 children using tablets with pen and touch, and one of 18 children using a touchscreen tabletop device. Participants completed target touching and gesture drawing tasks. We found significant differences in performance by modality for tablet: children responded faster and slipped less with touch than pen. In the tabletop study, children responded more accurately to changing target locations (fewer holdovers), and were more accurate touching targets around the screen. Gesture recognition rates were consistent across devices. We provide design guidelines for children’s touchscreen interactions across screen sizes to inform the design of future touchscreen applications for children.
Publisher's Version Article Search
Toward an Efficient Body Expression Recognition Based on the Synthesis of a Neutral Movement
Arthur Crenn, Alexandre Meyer, Rizwan Ahmed Khan, Hubert Konik, and Saida Bouakaz
(University of Lyon, France; University of Saint-Etienne, France)
We present a novel framework for the recognition of body expressions using human postures. Proposed system is based on analyzing the spectral difference between an expressive and a neutral animation. Second problem that has been addressed in this paper is formalization of neutral animation. Formalization of neutral animation has not been tackled before and it can be very useful for the domain of synthesis of animation, recognition of expressions, etc. In this article, we proposed a cost function to synthesize a neutral motion from expressive motion. The cost function formalizes a neutral motion by computing the distance and by combining it with acceleration of each body joints during a motion. We have evaluated our approach on several databases with heterogeneous movements and body expressions. Our body expression recognition results exceeds state of the art on evaluated databases.
Publisher's Version Article Search
Interactive Narration with a Child: Impact of Prosody and Facial Expressions
Ovidiu Șerban, Mukesh Barange, Sahba Zojaji, Alexandre Pauchet, Adeline Richard, and Emilie Chanoni
(Normandy University, France; University of Rouen, France)
Intelligent Virtual Agents are suitable means for interactive storytelling for children. The engagement level of child interaction with virtual agents is a challenging issue in this area. However, the characteristics of child-agent interaction received moderate to little attention in scientific studies whereas such knowledge may be crucial to design specific applications. This article proposes a Wizard of Oz platform for interactive narration. An experimental study in the context of interactive storytelling exploiting this platform is presented to evaluate the impact of agent prosody and facial expressions on child participation during storytelling. The results show that the use of the virtual agent with prosody and facial expression modalities improves the engagement of children in interaction during the narrative sessions.
Publisher's Version Article Search
Comparing Human and Machine Recognition of Children’s Touchscreen Stroke Gestures
Alex Shaw, Jaime Ruiz, and Lisa Anthony
(University of Florida, USA)
Children's touchscreen stroke gestures are poorly recognized by existing recognition algorithms, especially compared to adults' gestures. It seems clear that improved recognition is necessary, but how much is realistic? Human recognition rates may be a good starting point, but no prior work exists establishing an empirical threshold for a target accuracy in recognizing children's gestures based on human recognition. To this end, we present a crowdsourcing study in which naïve adult viewers recruited via Amazon Mechanical Turk were asked to classify gestures produced by 5- to 10-year-old children. We found a significant difference between human (90.60%) and machine (84.14%) recognition accuracy, over all ages. We also found significant differences between human and machine recognition of gestures of different types: humans perform much better than machines do on letters and numbers versus symbols and shapes. We provide an empirical measure of the accuracy that future machine recognition should aim for, as well as a guide for which categories of gestures have the most room for improvement in automated recognition. Our findings will inform future work on recognition of children's gestures and improving applications for children.
Publisher's Version Article Search

Oral Session 2: Understanding Human Behaviour

Virtual Debate Coach Design: Assessing Multimodal Argumentation Performance
Volha Petukhova, Tobias Mayer, Andrei Malchanau, and Harry Bunt
(Saarland University, Germany; Tilburg University, Netherlands)
This paper discusses the design and evaluation of a coaching system used to train young politicians to apply appropriate multimodal rhetoric devices to improve their debate skills. The presented study is carried out to develop debate performance assessment methods and interaction models underlying a Virtual Debate Coach (VDC) application. We identify a number of criteria associated with three questions: (1) how convincing is a debater's argumentation; (2) how well are debate arguments structured; and (3) how well is an argument delivered. We collected and analysed multimodal data of trainees' debate behaviour, and contrasted it with that of skilled professional debaters. Observational, correlation and machine learning experiments were performed to identify multimodal correlates of convincing debate performance and link them to experts' assessments. A rich set of prosodic, motion, linguistic and structural features was considered for the system to operate on. The VDC system was positively evaluated in a trainee-based setting.
Publisher's Version Article Search
Predicting the Distribution of Emotion Perception: Capturing Inter-rater Variability
Biqiao Zhang, Georg Essl, and Emily Mower Provost
(University of Michigan, USA; University of Wisconsin-Milwaukee, USA)
Emotion perception is person-dependent and variable. Dimensional characterizations of emotion can capture this variability by describing emotion in terms of its properties (e.g., valence, positive vs. negative, and activation, calm vs. excited). However, in many emotion recognition systems, this variability is often considered "noise" and is attenuated by averaging across raters. Yet, inter-rater variability provides information about the subtlety or clarity of an emotional expression and can be used to describe complex emotions. In this paper, we investigate methods that can effectively capture the variability across evaluators by predicting emotion perception as a discrete probability distribution in the valence-activation space. We propose: (1) a label processing method that can generate two-dimensional discrete probability distributions of emotion from a limited number of ordinal labels; (2) a new approach that predicts the generated probabilistic distributions using dynamic audio-visual features and Convolutional Neural Networks (CNNs). Our experimental results on the MSP-IMPROV corpus suggest that the proposed approach is more effective than the conventional Support Vector Regressions (SVRs) approach with utterance-level statistical features, and that feature-level fusion of the audio and video modalities outperforms decision-level fusion. The proposed CNN model predominantly improves the prediction accuracy for the valence dimension and brings a consistent performance improvement over data recorded from natural interactions. The results demonstrate the effectiveness of generating emotion distributions from limited number of labels and predicting the distribution using dynamic features and neural networks.
Publisher's Version Article Search
Automatically Predicting Human Knowledgeability through Non-verbal Cues
Abdelwahab Bourai, Tadas Baltrušaitis, and Louis-Philippe Morency
(Carnegie Mellon University, USA)
Humans possess an incredible ability to transmit and decode ``metainformation" through non-verbal actions in daily communication amongst each other. One communicative phenomena that is transmitted through these subtle cues is knowledgeability. In this work, we conduct two experiments. First, we analyze which non-verbal features are important for identifying knowledgeable people when responding to a question. Next, we train a model to predict the knowledgeability of speakers in a game show setting. We achieve results that surpass chance and human performance using a multimodal approach fusing prosodic and visual features. We believe computer systems that can incorporate emotional reasoning of this level can greatly improve human-computer communication and interaction.
Publisher's Version Article Search
Pooling Acoustic and Lexical Features for the Prediction of Valence
Zakaria Aldeneh, Soheil Khorram, Dimitrios Dimitriadis, and Emily Mower Provost
(University of Michigan, USA; IBM Research, USA)
In this paper, we present an analysis of different multimodal fusion approaches in the context of deep learning, focusing on pooling intermediate representations learned for the acoustic and lexical modalities. Traditional approaches to multimodal feature pooling include: concatenation, element-wise addition, and element-wise multiplication. We compare these traditional methods to outer-product and compact bilinear pooling approaches, which consider more comprehensive interactions between features from the two modalities. We also study the influence of each modality on the overall performance of a multimodal system. Our experiments on the IEMOCAP dataset suggest that: (1) multimodal methods that combine acoustic and lexical features outperform their unimodal counterparts; (2) the lexical modality is better for predicting valence than the acoustic modality; (3) outer-product-based pooling strategies outperform other pooling strategies.
Publisher's Version Article Search

Oral Session 3: Touch and Gesture

Hand-to-Hand: An Intermanual Illusion of Movement
Dario Pittera, Marianna Obrist, and Ali Israr
(Disney Research, USA; University of Sussex, UK)
Apparent tactile motion has been shown to occur across many contiguous part of the body, such as fingers, forearms, and back. A recent study demonstrated the possibility of eliciting the illusion of movement from one hand to the other when interconnected by a tablet. In this paper we explore intermanual apparent tactile motion without any object between them. In a series of psychophysical experiments we determine the control space for generating smooth and consistent motion, using two vibrating handles which we refer to as the Hand-to-Hand vibrotactile device. In a first experiment we investigated the occurrence of the phenomenon (i.e., movement illusion) and the generation of a perceptive model. In a second experiment, based on those results, we investigated the effect of hand postures on the illusion. Finally, in a third experiment we explored two visuo-tactile matching tasks in a multimodal VR setting. Our results can be applied in VR applications with intermanual tactile interactions.
Publisher's Version Article Search
An Investigation of Dynamic Crossmodal Instantiation in TUIs
Feng Feng and Tony Stockman
(Queen Mary University of London, UK)
There is a growing research interest in combining crossmodal research with tangible interaction design. However, most behavioural research on crossmodal perception and on multimodal tangible interaction evaluation are based on static and non-interactive signals. Grounded in the cognitive neuroscience studies and behavioural research on multisensory perception, we present two interactive, dynamic crossmodal studies based on unimodal and crossmodal physical dimensions through an interactive table-top. We tested the implementation in two user studies. Results indicate that first, dynamic crossmodal congruency was spontaneously invoked during the tangible interaction and elicited better user performance than uni-modal instantiation in the conditions involving vision and haptics, second, crossmodal instantiations using all three modalities may not produce the best user performance in the interaction, and third, participant's subconscious reaction towards perceived stimuli during interaction may not always consistent with their explicit verbalization.
Publisher's Version Article Search
“Stop over There”: Natural Gesture and Speech Interaction for Non-critical Spontaneous Intervention in Autonomous Driving
Robert Tscharn, Marc Erich Latoschik, Diana Löffler, and Jörn Hurtienne
(University of Würzburg, Germany)
We propose a new multimodal input technique for Non-critical Spontaneous Situations (NCSSs) in autonomous driving scenarios such as selecting a parking lot or picking up a hitchhiker. Speech and deictic (pointing) gestures were combined to instruct the car about desired interventions which include spatial references to the current environment (e.g., ''stop over [pointing] there'' or ''take [pointing] this parking lot''). In this way, advantages from both modalities were exploited: Speech allows for selecting from many maneuvres and functions in the car (e.g., stop, park), whereas deictic gestures provide a natural and intuitive way of indicating spatial discourse referents used in these interventions (e.g., near this tree, that parking lot). The speech and pointing gesture input was compared to speech and touch-based input in a user study with 38 participants. The touch-based input was selected as a baseline due to its widespread use in in-car touch screens. The evaluation showed that speech and pointing gestures are perceived more natural, intuitive and less cognitively demanding compared to speech and touch and are thus recommended as NCSSs intervention technique for autonomous driving.
Publisher's Version Article Search
Pre-touch Proxemics: Moving the Design Space of Touch Targets from Still Graphics towards Proxemic Behaviors
Ilhan Aslan and Elisabeth André
(University of Augsburg, Germany)
Proxemic touch targets continuously change in relation to a user's hand in mid-air before a physical touch occurs. Previous work has, for example shown that expanding targets are capable to improve target acquisition performance on touch interfaces. However, it is unclear how proxemic touch targets influence user experience (UX) in a broader sense, including hedonic qualities. Towards closing this research gap the paper reports on two user studies. The first study is a qualitative study with five experts, providing in-depth insights on a variety of change-types (e.g., size, form, color) and how they, for example, influence perceived functional and aesthetic qualities of proxemic touch targets. A follow-up user study with 36 participants explores the UX of a proxemic touch target compared to non-proximal versions of the same target. The results highlight a positive significant effect of the proxemic design on both pragmatic and hedonic qualities.
Publisher's Version Article Search
Freehand Grasping in Mixed Reality: Analysing Variation during Transition Phase of Interaction
Maadh Al-Kalbani, Maite Frutos-Pascual, and Ian Williams
(Birmingham City University, UK)
This paper presents an assessment of the variability in freehand grasping of virtual objects in an exocentric mixed reality environment. We report on an experiment covering 480 grasp based motions in the transition phase of interaction to determine the level of variation to the grasp aperture. Controlled laboratory conditions were used where 30 right-handed participants were instructed to grasp and move an object (cube or sphere) using a single medium wrap grasp from a starting location (A) to a target location (B) in a controlled manner. We present a comprehensive statistical analysis of the results showing the variation in grasp change during this phase of interaction. In conclusion we detail recommendations for freehand virtual object interaction design notably that considerations should be given to the change in grasp aperture over the transition phase of interaction.
Publisher's Version Article Search
Rhythmic Micro-Gestures: Discreet Interaction On-the-Go
Euan Freeman, Gareth Griffiths, and Stephen A. Brewster
(University of Glasgow, UK)
We present rhythmic micro-gestures, micro-movements of the hand that are repeated in time with a rhythm. We present a user study that investigated how well users can perform rhythmic micro-gestures and if they can use them eyes-free with non-visual feedback. We found that users could successfully use our interaction technique (97% success rate across all gestures) with short interaction times, rating them as low difficulty as well. Simple audio cues that only convey the rhythm outperformed animations showing the hand movements, supporting rhythmic micro-gestures as an eyes-free input technique.
Publisher's Version Article Search

Oral Session 4: Sound and Interaction

Evaluation of Psychoacoustic Sound Parameters for Sonification
Jamie Ferguson and Stephen A. Brewster
(University of Glasgow, UK)
Sonification designers have little theory or experimental evidence to guide the design of data-to-sound mappings. Many mappings use acoustic representations of data values which do not correspond with the listener's perception of how that data value should sound during sonification. This research evaluates data-to-sound mappings that are based on psychoacoustic sensations, in an attempt to move towards using data-to-sound mappings that are aligned with the listener's perception of the data value's auditory connotations. Multiple psychoacoustic parameters were evaluated over two experiments, which were designed in the context of a domain-specific problem - detecting the level of focus of an astronomical image through auditory display. Recommendations for designing sonification systems with psychoacoustic sound parameters are presented based on our results.
Publisher's Version Article Search
Utilising Natural Cross-Modal Mappings for Visual Control of Feature-Based Sound Synthesis
Augoustinos Tsiros and Grégory Leplâtre
(Edinburgh Napier University, UK)
This paper presents the results of an investigation into audio-visual (AV) correspondences conducted as part of the development of Morpheme, a painting interface to control a corpus-based concatenative sound synthesis algorithm. Previous research has identified strong AV correspondences between dimensions such as pitch and vertical position or loudness and size. However, these correspondences are usually established empirically by only varying a single audio or visual parameter. Although it is recognised that the perception of AV correspondences is affected by the interaction between the parameters of auditory or visual stimuli when these are complex multidimensional objects, there has been little research into perceived AV correspondences when complex dynamic sounds are involved. We conducted an experiment in which two AV mapping strategies and three audio corpora were empirically evaluated. 110 participants were asked to rate the perceived similarity of six AV associations. The results confirmed that size/loudness, vertical position/pitch, colour brightness/spectral brightness are strongly associated. A weaker but significant association was found between texture granularity and sound dissonance, as well as colour complexity and sound dissonance. Harmonicity was found to have a moderating effect on the perceived strengths of these associations: the higher the harmonicity of the sounds, the stronger the perceived AV associations.
Publisher's Version Article Search Artifacts Available

Oral Session 5: Methodology

Automatic Classification of Auto-correction Errors in Predictive Text Entry Based on EEG and Context Information
Felix Putze, Maik Schünemann, Tanja Schultz, and Wolfgang Stuerzlinger
(University of Bremen, Germany; Simon Fraser University, Canada)

State-of-the-art auto-correction methods for predictive text entry systems work reasonably well, but can never be perfect due to the properties of human language. We present an approach for the automatic detection of erroneous auto-corrections based on brain activity and text-entry-based context features. We describe an experiment and a new system for the classification of human reactions to auto-correction errors. We show how auto-correction errors can be detected with an average accuracy of 85%.

Publisher's Version Article Search
Cumulative Attributes for Pain Intensity Estimation
Joy O. Egede and Michel Valstar
(University of Nottingham at Ningbo, China; University of Nottingham, UK)
Pain estimation from face video is a hard problem in automatic behaviour understanding. One major obstacle is the difficulty of collecting sufficient amounts of data, with balanced amounts of data for all pain intensity levels. To overcome this, we propose to adopt Cumulative Attributes, which assume that attributes for high pain levels with few examples are a superset of all attributes of lower pain levels. Experimental results show a consistent relative performance increase in the order of 20% regardless of features used. Our final system significantly outperforms the state of the art on the UNBC McMaster Shoulder Pain database by using cumulative attributes with Relevance Vector Regression on a combination of features, including appearance, geometric, and deep learned features.
Publisher's Version Article Search
Towards the Use of Social Interaction Conventions as Prior for Gaze Model Adaptation
Rémy Siegfried, Yu Yu, and Jean-Marc Odobez
(Idiap, Switzerland; EPFL, Switzerland)
Gaze is an important non-verbal cue involved in many facets of social interactions like communication, attentiveness or attitudes. Nevertheless, extracting gaze directions visually and remotely usually suffers large errors because of low resolution images, inaccurate eye cropping, or large eye shape variations across the population, amongst others. This paper hypothesizes that these challenges can be addressed by exploiting multimodal social cues for gaze model adaptation on top of an head-pose independent 3D gaze estimation framework. First, a robust eye cropping refinement is achieved by combining a semantic face model with eye landmark detections. Investigations on whether temporal smoothing can overcome instantaneous refinement limitations is conducted. Secondly, to study whether social interaction convention could be used as priors for adaptation, we exploited the speaking status and head pose constraints to derive soft gaze labels and infer person-specific gaze bias using robust statistics. Experimental results on gaze coding in natural interactions from two different settings demonstrate that the two steps of our gaze adaptation method contribute to reduce gaze errors by a large margin over the baseline and can be generalized to several identities in challenging scenarios.
Publisher's Version Article Search
Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and Louis-Philippe Morency
(Carnegie Mellon University, USA)
With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we propose a novel deep architecture for multimodal sentiment analysis that is able to perform modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding allows us to alleviate the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention can perform word level fusion at a finer fusion resolution between the input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. These results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion.
Publisher's Version Article Search
IntelliPrompter: Speech-Based Dynamic Note Display Interface for Oral Presentations
Reza Asadi, Ha Trinh, Harriet J. Fell, and Timothy W. Bickmore
(Northeastern University, USA)
The fear of forgetting what to say next is a common problem in oral presentation delivery. To maintain good content coverage and reduce anxiety, many presenters use written notes in conjunction with presentation slides. However, excessive use of notes during delivery can lead to disengagement with the audience and low quality presentations. We designed IntelliPrompter, a speech-based note display system that automatically tracks a presenter’s coverage of each slide’s content and dynamically adjusts the note display interface to highlight the most likely next topic to present. We developed two versions of our intelligent teleprompters using Google Glass and computer screen displays. The design of our system was informed by findings from 36 interviews with presenters and analysis of a presentation note corpus. In a within-subjects study comparing our dynamic screen-based and Google Glass note display interfaces with a static note system, presenters and independent judges expressed a strong preference for the dynamic screen-based system.
Publisher's Version Article Search

Oral Session 6: Artificial Agents and Wearable Sensors

Head and Shoulders: Automatic Error Detection in Human-Robot Interaction
Pauline Trung, Manuel Giuliani, Michael Miksch, Gerald Stollnberger, Susanne Stadler, Nicole Mirnig, and Manfred Tscheligi
(University of Salzburg, Austria; University of the West of England, UK; Austrian Institute of Technology, Austria)
We describe a novel method for automatic detection of errors in human-robot interactions. Our approach is to detect errors based on the classification of head and shoulder movements of humans who are interacting with erroneous robots. We conducted a user study in which participants interacted with a robot that we programmed to make two types of errors: social norm violations and technical failures. During the interaction, we recorded the behavior of the participants with a Kinect v1 RGB-D camera. Overall, we recorded a data corpus of 237,998 frames at 25 frames per second; 83.48% frames showed no error situation; 16.52% showed an error situation. Furthermore, we computed six different feature sets to represent the movements of the participants and temporal aspects of their movements. Using this data we trained a rule learner, a Naive Bayes classifier, and a k-nearest neighbor classifier and evaluated the classifiers with 10-fold cross validation and leave-one-out cross validation. The results of this evaluation suggest the following: (1) The detection of an error situation works well, when the robot has seen the human before; (2) Rule learner and k-nearest neighbor classifiers work well for automated error detection when the robot is interacting with a known human; (3) For unknown humans, the Naive Bayes classifier performed the best; (4) The classification of social norm violations does perform the worst; (5) There was no big performance difference between using the original data and normalized feature sets that represent the relative position of the participants.
Publisher's Version Article Search
The Reliability of Non-verbal Cues for Situated Reference Resolution and Their Interplay with Language: Implications for Human Robot Interaction
Stephanie Gross, Brigitte Krenn, and Matthias Scheutz
(Austrian Research Institute for Artificial Intelligence, Austria; Tufts University, USA)
When uttering referring expressions in situated task descriptions, humans naturally use verbal and non-verbal channels to transmit information to their interlocutor. To develop mechanisms for robot architectures capable of resolving object references in such interaction contexts, we need to better understand the multi-modality of human situated task descriptions. In current computational models, mainly pointing gestures, eye gaze, and objects in the visual field are included as non-verbal cues, if any. We analyse reference resolution to objects in an object manipulation task and find that only up to 50% of all referring expressions to objects can be resolved including language, eye gaze and pointing gestures. Thus, we extract other non-verbal cues necessary for reference resolution to objects, investigate the reliability of the different verbal and non-verbal cues, and formulate lessons for the design of a robot’s natural language understanding capabilities.
Publisher's Version Article Search
Do You Speak to a Human or a Virtual Agent? Automatic Analysis of User’s Social Cues during Mediated Communication
Magalie Ochs, Nathan Libermann, Axel Boidin, and Thierry Chaminade
(Aix-Marseille University, France; University of Toulon, France; Picxel, France)
While several research works have shown that virtual agents are able to generate natural and social behaviors from users, few of them have compared these social reactions to those expressed dur- ing a human-human mediated communication. In this paper, we propose to explore the social cues expressed by a user during a mediated communication either with an embodied conversational agent or with another human. For this purpose, we have exploited a machine learning method to identify the facial and head social cues characteristics in each interaction type and to construct a model to automatically determine if the user is interacting with a virtual agent or another human. ‘e results show that, in fact, the users do not express the same facial and head movements during a communication with a virtual agent or another user. Based on these results, we propose to use such a machine learning model to automatically measure the social capability of a virtual agent to generate a social behavior in the user comparable to a human- human interaction. ‘e resulting model can detect automatically if the user is communicating with a virtual or real interlocutor, looking only at the user’s face and head during one second.
Publisher's Version Article Search
Estimating Verbal Expressions of Task and Social Cohesion in Meetings by Quantifying Paralinguistic Mimicry
Marjolein C. Nanninga, Yanxia Zhang, Nale Lehmann-Willenbrock, Zoltán Szlávik, and Hayley Hung
(Delft University of Technology, Netherlands; University of Amsterdam, Netherlands; VU University Amsterdam, Netherlands)
In this paper we propose a novel method of estimating verbal expressions of task and social cohesion by quantifying the dynamic alignment of nonverbal behaviors in speech. As team cohesion has been linked to team effectiveness and productivity, automatically estimating team cohesion can be a useful tool for assessing meeting quality and broader team functioning. In total, more than 20 hours of business meetings (3-8 people) were recorded and annotated for behavioral indicators of group cohesion, distinguishing between social and task cohesion. We hypothesized that behaviors commonly referred to as mimicry can be indicative of verbal expressions of social and task cohesion. Where most prior work targets mimicry of dyads, we investigated the effectiveness of quantifying group-level phenomena. A dynamic approach was adopted in which both the cohesion expressions and the paralinguistic mimicry were quantified on small time windows. By extracting features solely related to the alignment of paralinguistic speech behavior, we found that 2-minute high and low social cohesive regions could be classified with a 0.71 Area under the ROC curve, performing on par with the state-of-the-art where turn-taking features were used. Estimating task cohesion was more challenging, obtaining an accuracy of 0.64 AUC, outperforming the state-of-the-art. Our results suggest that our proposed methodology is successful in quantifying group-level paralinguistic mimicry. As both the state-of-the-art turn-taking features and mimicry features performed worse on estimating task cohesion, we conclude that social cohesion is more openly expressed by nonverbal vocal behavior than task cohesion.
Publisher's Version Article Search
Data Augmentation of Wearable Sensor Data for Parkinson’s Disease Monitoring using Convolutional Neural Networks
Terry T. Um, Franz M. J. Pfister, Daniel Pichler, Satoshi Endo, Muriel Lang, Sandra Hirche, Urban Fietzek, and Dana Kulić
(University of Waterloo, Canada; LMU Munich, Germany; TU Munich, Germany; Schön Klinik München Schwabing, Germany)

While convolutional neural networks (CNNs) have been successfully applied to many challenging classification applications, they typically require large datasets for training. When the availability of labeled data is limited, data augmentation is a critical preprocessing step for CNNs. However, data augmentation for wearable sensor data has not been deeply investigated yet.

In this paper, various data augmentation methods for wearable sensor data are proposed. The proposed methods and CNNs are applied to the classification of the motor state of Parkinson’s Disease patients, which is challenging due to small dataset size, noisy labels, and large intra-class variability. Appropriate augmentation improves the classification performance from 77.54% to 86.88%.

Publisher's Version Article Search

Poster Session 1

Automatic Assessment of Communication Skill in Non-conventional Interview Settings: A Comparative Study
Pooja Rao S. B, Sowmya Rasipuram, Rahul Das, and Dinesh Babu Jayagopi
(IIIT Bangalore, India)
Effective communication is an important social skill that facilitates us to interpret and connect with people around us and is of utmost importance in employment based interviews. This paper presents a methodical study and automatic measurement of communication skill of candidates in different modes of behavioural interviews. It demonstrates a comparative analysis of non-conventional methods of employment interviews namely 1) Interface-based asynchronous video interviews and 2) Written interviews (including a short essay). In order to achieve this, we have collected a dataset of 100 structured interviews from participants. These interviews are evaluated independently by two human expert annotators on rubrics specific to each of the settings. We, then propose a predictive model using automatically extracted multimodal features like audio, visual and lexical, applying classical machine learning algorithms. Our best model performs with an accuracy of 75% for a binary classification task in all the three contexts. We also study the differences between the expert perception and the automatic prediction across the settings.
Publisher's Version Article Search
Low-Intrusive Recognition of Expressive Movement Qualities
Radoslaw Niewiadomski, Maurizio Mancini, Stefano Piana, Paolo Alborno, Gualtiero Volpe, and Antonio Camurri
(University of Genoa, Italy)
In this paper we present a low-intrusive approach to the detection of expressive full-body movement qualities. We focus on two qualities: Lightness and Fragility and we detect them using the data captured by four wearable devices, two Inertial Movement Units (IMU) and two electromyographs (EMG), placed on the forearms. The work we present in the paper stems from a strict collaboration with expressive movement experts (e.g., contemporary dance choreographers) for defining a vocabulary of basic movement qualities. We recorded 13 dancers performing movements expressing the qualities under investigation. The recordings were next segmented and the perceived level of each quality for each segment was ranked by 5 experts using a 5-points Likert scale. We obtained a dataset of 150 segments of movement expressing Fragility and/or Lightness. In the second part of the paper, we define a set of features on IMU and EMG data and we extract them on the recorded corpus. We finally applied a set of supervised machine learning techniques to classify the segments. The best results for the whole dataset were obtained with a Naive Bayes classifier for Lightness (F-score 0.77), and with a Support Vector Machine classifier for Fragility (F-score 0.77). Our approach can be used in ecological contexts e.g., during artistic performances.
Publisher's Version Article Search
Digitising a Medical Clerking System with Multimodal Interaction Support
Harrison South, Martin Taylor, Huseyin Dogan, and Nan Jiang
(Bournemouth University, UK; Royal Bournemouth and Christchurch Hospital, UK)
The Royal Bournemouth and Christchurch Hospitals (RBCH) use a series of paper forms to record their interactions with patients; while these have been highly successful, the world is moving digitally and the National Health Service (NHS) has planned to be completely paperless by 2020. Using a project management methodology called Scrum that is supported by a usability evaluation technique called System Usability Scale (SUS) and a workload measurement technique called NASA TLX, a prototype web application system was built and evaluated for the client. The prototype used a varied set of input mediums including voice, text and stylus to ensure that users were more likely to adopt the system. This web based system was successfully developed and evaluated at RBCH. This evaluation showed that the application was usable and accessible but raised many different questions about the nature of software in hospitals. While the project looked at how different input mediums can be used in a hospital, it found that just because it is possible to input data is some familiar format (e.g. voice), it is not always in the best interest of the end-users and the patients.
Publisher's Version Article Search
GazeTap: Towards Hands-Free Interaction in the Operating Room
Benjamin Hatscher, Maria Luz, Lennart E. Nacke, Norbert Elkmann, Veit Müller, and Christian Hansen
(University of Magdeburg, Germany; University of Waterloo, Canada; Fraunhofer IFF, Germany)
During minimally-invasive interventions, physicians need to interact with medical image data, which cannot be done while the hands are occupied. To address this challenge, we propose two interaction techniques which use gaze and foot as input modalities for hands-free interaction. To investigate the feasibility of these techniques, we created a setup consisting of a mobile eye-tracking device, a tactile floor, two laptops, and the large screen of an angiography suite. We conducted a user study to evaluate how to navigate medical images without the need for hand interaction. Both multimodal approaches, as well as a foot-only interaction technique, were compared regarding task completion time and subjective workload. The results revealed comparable performance of all methods. Selection is accomplished faster via gaze than with a foot only approach, but gaze and foot easily interfere when used at the same time. This paper contributes to HCI by providing techniques and evaluation results for combined gaze and foot interaction when standing. Our method may enable more effective computer interactions in the operating room, resulting in a more beneficial use of medical information.
Publisher's Version Article Search
Boxer: A Multimodal Collision Technique for Virtual Objects
Byungjoo Lee, Qiao Deng, Eve Hoggan, and Antti Oulasvirta
(Aalto University, Finland; KAIST, South Korea; Aarhus University, Denmark)
Virtual collision techniques are interaction techniques for invoking discrete events in a virtual scene, e.g. throwing, pushing, or pulling an object with a pointer. The conventional approach involves detecting collisions as soon as the pointer makes contact with the object. Furthermore, in general, motor patterns can only be adjusted based on visual feedback. The paper presents a multimodal technique based on the principle that collisions should be aligned with the most salient sensory feedback. Boxer (1) triggers a collision at the moment where the pointer's speed reaches a minimum after first contact and (2) is synchronized with vibrotactile stimuli presented to the hand controlling the pointer. Boxer was compared with the conventional technique in two user studies (with temporal pointing and virtual batting). Boxer improved spatial precision in collisions by 26.7 % while accuracy was compromised under some task conditions. No difference was found in temporal precision. Possibilities for improving virtual collision techniques are discussed.
Publisher's Version Article Search
Trust Triggers for Multimodal Command and Control Interfaces
Helen Hastie, Xingkun Liu, and Pedro Patron
(Heriot-Watt University, UK; SeeByte, UK)
For autonomous systems to be accepted by society and operators, they have to instil the appropriate level of trust. In this paper, we discuss what dimensions constitute trust and examine certain triggers of trust for an autonomous underwater vehicle, comparing a multimodal command and control interface with a language-only reporting system. We conclude that there is a relationship between perceived trust and the clarity of a user's Mental Model and that this Mental Model is clearer in a multimodal condition, compared to language-only. Regarding trust triggers, we are able to show that a number of triggers, such as anomalous sensor readings, noticeably modify the perceived trust of the subjects, but in an appropriate manner, thus illustrating the utility of the interface.
Publisher's Version Article Search Info
TouchScope: A Hybrid Multitouch Oscilloscope Interface
Matthew Heinz, Sven Bertel, and Florian Echtler
(Bauhaus-Universität Weimar, Germany)
We present TouchScope, a hybrid multitouch interface for common off-the-shelf oscilloscopes. Oscilloscopes are a valuable tool for analyzing and debugging electronic circuits, but are also complex scientific instruments. Novices are faced with a seemingly overwhelming array of knobs and buttons, and usually require lengthy training before being able to use these devices productively. In this paper, we present our implementation of TouchScope which uses a multitouch tablet in combination with an unmodified off-the-shelf oscilloscope to provide a novice-friendly hybrid interface, combining both the low entry barrier of a touch-based interface and the high degrees of freedom of a conventional button-based interface. Our evaluation with 29 inexperienced participants shows a comparable performance to traditional learning materials as well as a significantly higher level of perceived usability.
Publisher's Version Article Search
A Multimodal System to Characterise Melancholia: Cascaded Bag of Words Approach
Shalini Bhatia, Munawar Hayat, and Roland Goecke
(University of Canberra, Australia)
Recent years have seen a lot of activity in affective computing for automated analysis of depression. However, no research has so far proposed a multimodal system for classifying different subtypes of depression such as melancholia. The mental state assessment of a mood disorder depends primarily on appearance, behaviour, speech, thought, perception, mood and facial affect. Mood and facial affect mainly contribute to distinguishing melancholia from non-melancholia. These are assessed by clinicians, and hence vulnerable to subjective judgement. As a result, clinical assessment alone may not accurately capture the presence or absence of specific disorders such as melancholia, a distressing condition whose presence has important treatment implications. Melancholia is characterised by severe anhedonia and psychomotor disturbance, which can be a combination of motor retardation with periods of superimposed agitation. Psychomotor disturbance can be sensed in both face and voice. To the best of our knowledge, this study is the first attempt to propose a multimodal system to differentiate melancholia from non-melancholia and healthy controls. We report the sensitivity and specificity of classification in depressive subtypes.
Publisher's Version Article Search
Crowdsourcing Ratings of Caller Engagement in Thin-Slice Videos of Human-Machine Dialog: Benefits and Pitfalls
Vikram Ramanarayanan, Chee Wee Leong, David Suendermann-Oeft, and Keelan Evanini
(ETS at San Francisco, USA; ETS at Princeton, USA)
We analyze the efficacy of different crowds of naive human raters in rating engagement during human--machine dialog interactions. Each rater viewed multiple 10 second, thin-slice videos of native and non-native English speakers interacting with a computer-assisted language learning (CALL) system and rated how engaged and disengaged those callers were while interacting with the automated agent. We observe how the crowd's ratings compared to callers' self ratings of engagement, and further study how the distribution of these rating assignments vary as a function of whether the automated system or the caller was speaking. Finally, we discuss the potential applications and pitfalls of such crowdsourced paradigms in designing, developing and analyzing engagement-aware dialog systems.
Publisher's Version Article Search
Modelling Fusion of Modalities in Multimodal Interactive Systems with MMMM
Bruno Dumas, Jonathan Pirau, and Denis Lalanne
(University of Namur, Belgium; University of Fribourg, Switzerland)
Several models and design spaces have been defined and are regularly used to describe how modalities can be fused together in an interactive multimodal system. However, models such as CASE, the CARE properties or TYCOON have all been defined more than two decades ago. In this paper, we start with a critical review of these models, which notably highlighted a confusion between how the user and the system side of a multimodal system were described. Based on this critical review, we define MMMM v1, an improved model for the description of multimodal fusion in interactive systems targeting completeness. A first user evaluation comparing the models revealed that MMMM v1 was indeed complete, but at the cost of user friendliness. Based on the results of this first evaluation, an improved version of MMMM, called MMMM v2 was defined. A second user evaluation highlighted that this model achieved a good balance between complexity, consistency and completeness compared to the state of the art.
Publisher's Version Article Search
Temporal Alignment using the Incremental Unit Framework
Casey Kennington, Ting Han, and David Schlangen
(Boise State University, USA; Bielefeld University, Germany)
We propose a method for temporal alignment--a precondition of meaningful fusion--of multimodal systems, using the incremental unit dialogue system framework, which gives the system flexibility in how it handles alignment: either by delaying a modality for a specified amount of time, or by revoking (i.e., backtracking) processed information so multiple information sources can be processed jointly. We evaluate our approach in an offline experiment with multimodal data and find that using the incremental framework is flexible and shows promise as a solution to the problem of temporal alignment in multimodal systems.
Publisher's Version Article Search
Multimodal Gender Detection
Mohamed Abouelenien, Verónica Pérez-Rosas, Rada Mihalcea, and Mihai Burzo
(University of Michigan, USA)
Automatic gender classification is receiving increasing attention in the computer interaction community as the need for personalized, reliable, and ethical systems arises. To date, most gender classification systems have been evaluated on textual and audiovisual sources. This work explores the possibility of enhancing such systems with physiological cues obtained from thermography and physiological sensor readings. Using a multimodal dataset consisting of audiovisual, thermal, and physiological recordings of males and females, we extract features from five different modalities, namely acoustic, linguistic, visual, thermal, and physiological. We then conduct a set of experiments where we explore the gender prediction task using single and combined modalities. Experimental results suggest that physiological and thermal information can be used to recognize gender at reasonable accuracy levels, which are comparable to the accuracy of current gender prediction systems. Furthermore, we show that the use of non-contact physiological measurements, such as thermography readings, can enhance current systems that are based on audio or visual input. This can be particularly useful for scenarios where non-contact approaches are preferred, i.e., when data is captured under noisy audiovisual conditions or when video or speech data are not available due to ethical considerations.
Publisher's Version Article Search
How May I Help You? Behavior and Impressions in Hospitality Service Encounters
Skanda Muralidhar, Marianne Schmid Mast, and Daniel Gatica-Perez
(Idiap, Switzerland; EPFL, Switzerland; University of Lausanne, Switzerland)

In the service industry, customers often assess quality of service based on the behavior, perceived personality, and other attributes of the front line service employees they interact with. Interpersonal communication during these interactions is key to determine customer satisfaction and perceived service quality. We present a computational framework to automatically infer perceived performance and skill variables of employees interacting with customers in a hotel reception desk setting using nonverbal behavior, studying a dataset of 169 dyadic interactions involving students from a hospitality management school. We also study the connections between impressions of Big-5 personality traits, attractiveness, and performance of receptionists. In regression tasks, our automatic framework achieves R2=0.30 for performance impressions using audio-visual nonverbal cues, compared to 0.35 using personality impressions, while attractiveness impressions had low predictive power. We also study the integration of nonverbal behavior and Big-5 personality impressions towards increasing regression performance (R2 = 0.37).

Publisher's Version Article Search
Tracking Liking State in Brain Activity while Watching Multiple Movies
Naoto Terasawa, Hiroki Tanaka, Sakriani Sakti, and Satoshi Nakamura
(NAIST, Japan)
Emotion is a valuable information in various applications ranging from human-computer interaction to automated multimedia content delivery. Conventional methods to recognize emotion were based on speech prosody cues, facial expression, and body language. However, this information may not appear when people watch a movie. In recent years, some studies have started to use electroencephalogram (EEG) signals in recognizing emotion. But, the EEG data were entirely analyzed in each scene of movies for emotion classification. Thus, the detailed information of emotional state changes cannot be extracted. In this study, we utilize EEG to track affective state during watching multiple movies. Experiments were done by measuring continuous liking state during watching three types of movies, and then constructing subject dependent emotional state tracking model. We used support vector machine (SVM) as a classifier, and support vector regression (SVR) for regression. As a result, the best classification accuracy was 77.6%, and the best regression model achieved 0.645 of correlation coefficient between actual liking state and predicted liking state. These results demonstrate that continuous emotional state can be predicted by our EEG-based method.
Publisher's Version Article Search
Does Serial Memory of Locations Benefit from Spatially Congruent Audiovisual Stimuli? Investigating the Effect of Adding Spatial Sound to Visuospatial Sequences
Benjamin Stahl and Georgios Marentakis
(Graz University of Technology, Austria)
We present a study that investigates whether multimodal audiovisual presentation improves the recall of visuospatial sequences. In a modified Corsi block-tapping experiment, participants were asked to serially recall a spatial stimulus sequence in conditions that manipulated stimulus modality and sequence length. Adding spatial sound to the visuospatial sequences did not improve serial spatial recall performance. The results support the hypothesis that no modality-specific components in spatial working memory exist.
Publisher's Version Article Search
ZSGL: Zero Shot Gestural Learning
Naveen Madapana and Juan Wachs
(Purdue University, USA)
Gesture recognition systems enable humans to interact with machines in an intuitive and a natural way. Humans tend to create the gestures on the fly and conventional systems lack adaptability to learn new gestures beyond the training stage. This problem can be best addressed using Zero Shot Learning (ZSL), a paradigm in machine learning that aims to recognize unseen objects by just having a description of them. ZSL for gestures has hardly been addressed in computer vision research due to the inherent ambiguity and the contextual dependency associated with the gestures. This work proposes an approach for Zero Shot Gestural Learning (ZSGL) by leveraging the semantic information that is embedded in the gestures. First, a human factors based approach has been followed to generate semantic descriptors for gestures that can generalize to the existing gesture classes. Second, we assess the performance of various existing state-of-the-art algorithms on ZSL for gestures using two standard datasets: MSRC-12 and CGD2011 dataset. The obtained results (26.35% - unseen class accuracy) parallel the benchmark accuracies of attribute-based object recognition and justifies our claim that ZSL is a desirable paradigm for gesture based systems.
Publisher's Version Article Search
Markov Reward Models for Analyzing Group Interaction
Gabriel Murray
(University of the Fraser Valley, Canada)
In this work we introduce a novel application of Markov Reward models for studying group interaction. We describe a sample state representation for social sequences in meetings, and give examples of how particular states can be associated with immediate positive or negative rewards, based on outcomes of interest. We then present a Value Iteration algorithm for estimating the values of states. While we focus on two specific applications of Markov Reward models to small group interaction in meetings, there are many ways in which such a model can be used to study different facets of group dynamics and interaction. To encourage such research, we are making the Value Iteration software freely available.
Publisher's Version Article Search Info
Analyzing First Impressions of Warmth and Competence from Observable Nonverbal Cues in Expert-Novice Interactions
Beatrice Biancardi, Angelo Cafaro, and Catherine Pelachaud
(CNRS, France; UPMC, France)
In this paper we present an analysis from a corpus of dyadic expert-novice knowledge sharing interactions. The analysis aims at investigating the relationship between observed non-verbal cues and first impressions formation of warmth and competence. We first obtained both discrete and continuous annotations of our data. Discrete descriptors include non-verbal cues such as type of gestures, arms rest poses, head movements and smiles. Continuous descriptors concern annotators' judgments of the expert's warmth and competence during the observed interaction with the novice. Then we computed Odds Ratios between those descriptors. Results highlight the role of smiling in warmth and competence impressions. Smiling is associated with increased levels of warmth and decreasing competence. It also affects the impact of others non-verbal cues (e.g. self-adaptors gestures) on warmth and competence. Moreover, our findings provide interesting insights about the role of rest poses, that are associated with decreased levels of warmth and competence impressions.
Publisher's Version Article Search
The NoXi Database: Multimodal Recordings of Mediated Novice-Expert Interactions
Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, and Michel Valstar
(CNRS, France; UPMC, France; University of Augsburg, Germany; University of Nottingham, UK)
We present a novel multi-lingual database of natural dyadic novice-expert interactions, named NoXi, featuring screen-mediated dyadic human interactions in the context of information exchange and retrieval. NoXi is designed to provide spontaneous interactions with emphasis on adaptive behaviors and unexpected situations (e.g. conversational interruptions). A rich set of audio-visual data, as well as continuous and discrete annotations are publicly available through a web interface. Descriptors include low level social signals (e.g. gestures, smiles), functional descriptors (e.g. turn-taking, dialogue acts) and interaction descriptors (e.g. engagement, interest, and fluidity).
Publisher's Version Article Search Info
Head-Mounted Displays as Opera Glasses: Using Mixed-Reality to Deliver an Egalitarian User Experience during Live Events
Carl Bishop, Augusto Esteves, and Iain McGregor
(Edinburgh Napier University, UK)
This paper explores the use of head-mounted displays (HMDs) as a way to deliver a front row experience to any audience member during a live event. To do so, it presents a two-part user study that compares participants reported sense of presence across three experimental conditions: front row, back row, and back row with HMD (displaying 360° video captured live from the front row). Data was collected using the Temple Presence Inventory (TPI), which measures presence across eight factors. The reported sense of presence in the HMD condition was significantly higher in five of these measures, including spatial presence, social presence, passive social presence, active social presence, and social richness. We argue that the non-significant differences found in the other three factors – engagement, social realism, and perceptual realism – are artefacts of participants’ personal taste for the song being performed, or the effects of using a mixed-reality approach. Finally, the paper describes a basic system for low-latency, 360° video live streaming using off-the-shelf, affordable equipment and software.
Publisher's Version Article Search

Poster Session 2

Analyzing Gaze Behavior during Turn-Taking for Estimating Empathy Skill Level
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka
(NTT, Japan)
Techniques that use nonverbal behaviors to estimate communication skill in discussions have been receiving a lot of attention in recent research. In this study, we explored the gaze behavior towards the end of an utterance during turn-keeping/changing to estimate empathy skills in multiparty discussions. First, we collected data on Davis' Interpersonal Reactivity Index (IRI) (which measures empathy skill), utterances, and gaze behavior from participants in four-person discussions. The results of the analysis showed that the gaze behavior during turn-keeping/changing differs in accordance with people's empathy skill levels. The most noteworthy result is that the amount of a person's empathy skill is inversely proportional to the frequency of eye contact with the conversational partner during turn-keeping/changing. Specifically, if the current speaker has a high skill level, she often does not look at listener during turn-keeping and turn-changing. Moreover, when a person with a high skill level is the next speaker, she does not look at the speaker during turn-changing. In contrast, people who have a low skill level often continue to make eye contact with speakers and listeners. On the basis of these findings, we constructed and evaluated four models for estimating empathy skill levels. The evaluation results showed that the average absolute error of estimation is only 0.22 for the gaze transition pattern (GTP) model. This model uses the occurrence probability of GTPs when the person is a speaker and listener during turn-keeping and speaker, next-speaker, and listener during turn-changing. It outperformed the models that used the amount of utterances and duration of gazes. This suggests that the GTP during turn-keeping and turn-changing is effective for estimating an individual's empathy skills in multi-party discussions.
Publisher's Version Article Search
Text Based User Comments as a Signal for Automatic Language Identification of Online Videos
A. Seza Doğruöz, Natalia Ponomareva, Sertan Girgin, Reshu Jain, and Christoph Oehler
(Xoogler, Turkey; Google, USA; Google, France; Google, Switzerland)
Identifying the audio language of online videos is crucial for industrial multi-media applications. Automatic speech recognition systems can potentially detect the language of the audio. However, such systems are not available for all languages. Moreover, background noise, music and multi-party conversations make audio language identification hard. Instead, we utilize text based user comments as a new signal to identify audio language of YouTube videos. First, we detect the language of the text based comments. Augmenting this information with video meta-data features, we predict the language of the videos with an accuracy of 97% on a set of publicly available videos. The subject matter discussed in this research is patent pending.
Publisher's Version Article Search
Gender and Emotion Recognition with Implicit User Signals
Maneesh Bilalpur, Seyed Mostafa Kia, Manisha Chawla, Tat-Seng Chua, and Ramanathan Subramanian
(IIIT Hyderabad, India; Radboud University, Netherlands; IIT Gandhinagar, India; National University of Singapore, Singapore; University of Glasgow, UK; Advanced Digital Sciences Center, Singapore)

We examine the utility of implicit user behavioral signals captured using low-cost, off-the-shelf devices for anonymous gender and emotion recognition. A user study designed to examine male and female sensitivity to facial emotions confirms that females recognize (especially negative) emotions quicker and more accurately than men, mirroring prior findings. Implicit viewer responses in the form of EEG brain signals and eye movements are then examined for existence of (a) emotion and gender-specific patterns from event-related potentials (ERPs) and fixation distributions and (b) emotion and gender discriminability. Experiments reveal that (i) Gender and emotion-specific differences are observable from ERPs, (ii) multiple similarities exist between explicit responses gathered from users and their implicit behavioral signals, and (iii) Significantly above-chance (≈70%) gender recognition is achievable on comparing emotion-specific EEG responses– gender differences are encoded best for anger and disgust. Also, fairly modest valence (positive vs negative emotion) recognition is achieved with EEG and eye-based features.

Publisher's Version Article Search
Animating the Adelino Robot with ERIK: The Expressive Robotics Inverse Kinematics
Tiago Ribeiro and Ana Paiva
(INESC-ID, Portugal; University of Lisbon, Portugal)
This paper presents ERIK, an inverse kinematics technique for robot animation that is able to control a real expressive manipulator-like robot in real-time, by simultaneously solving for both the robot's full body expressive posture, and orientation of the end-effector (head). Our solution, meant for generic autonomous social robots, was designed to work out of the box on any kinematic chain, and achieved by blending forward kinematics (for posture control) and inverse kinematics (for the gaze-tracking/orientation constraint). Results from a user study show that the resulting expressive motion is still able to convey an expressive intention to the users.
Publisher's Version Article Search Video
Automatic Detection of Pain from Spontaneous Facial Expressions
Fatma Meawad, Su-Yin Yang, and Fong Ling Loy
(University of Glasgow, UK; Tan Tock Seng Hospital, Singapore)
This paper presents a new approach for detecting pain in sequences of spontaneous facial expressions. The motivation for this work is to accompany mobile-based self-management of chronic pain as a virtual sensor for tracking patients' expressions in real-world settings. Operating under such constraints requires a resource efficient approach for processing non-posed facial expressions from unprocessed temporal data. In this work, the facial action units of pain are modeled as sets of distances among related facial landmarks. Using standardized measurements of pain versus no-pain that are specific to each user, changes in the extracted features in relation to pain are detected. The activated features in each frame are combined using an adapted form of the Prkachin and Solomon Pain Intensity scale (PSPI) to detect the presence of pain per frame. Painful features must be activated in N consequent frames (time window) to indicate the presence of pain in a session. The discussed method was tested on 171 video sessions for 19 subjects from the McMaster painful dataset for spontaneous facial expressions. The results show higher precision than coverage in detecting sequences of pain. Our algorithm achieves 94% precision (F-score=0.82) against human observed labels, 74% precision (F-score=0.62) against automatically generated pain intensities and 100% precision (F-score=0.67) against self-reported pain intensities.
Publisher's Version Article Search
Evaluating Content-Centric vs. User-Centric Ad Affect Recognition
Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, Mohan Kankanhalli, and Ramanathan Subramanian
(IIIT Hyderabad, India; Indian Institute of Science, India; Delft University of Technology, Netherlands; National University of Singapore, Singapore; University of Glasgow at Singapore, Singapore)

Despite the fact that advertisements (ads) often include strongly emotional content, very little work has been devoted to affect recognition (AR) from ads. This work explicitly compares content-centric and user-centric ad AR methodologies, and evaluates the impact of enhanced AR on computational advertising via a user study. Specifically, we (1) compile an affective ad dataset capable of evoking coherent emotions across users; (2) explore the efficacy of content-centric convolutional neural network (CNN) features for encoding emotions, and show that CNN features outperform low-level emotion descriptors; (3) examine user-centered ad AR by analyzing Electroencephalogram (EEG) responses acquired from eleven viewers, and find that EEG signals encode emotional information better than content descriptors; (4) investigate the relationship between objective AR and subjective viewer experience while watching an ad-embedded online video stream based on a study involving 12 users. To our knowledge, this is the first work to (a) expressly compare user vs content-centered AR for ads, and (b) study the relationship between modeling of ad emotions and its impact on a real-life advertising application.

Publisher's Version Article Search
A Domain Adaptation Approach to Improve Speaker Turn Embedding using Face Representation
Nam Le and Jean-Marc Odobez
(Idiap, Switzerland)
This paper proposes a novel approach to improve speaker modeling using knowledge transferred from face representation. In particular, we are interested in learning a discriminative metric which allows speaker turns to be compared directly, which is beneficial for tasks such as diarization and dialogue analysis. Our method improves the embedding space of speaker turns by applying maximum mean discrepancy loss to minimize the disparity between the distributions of facial and acoustic embedded features. This approach aims to discover the shared underlying structure of the two embedded spaces, thus enabling the transfer of knowledge from the richer face representation to the counterpart in speech. Experiments are conducted on broadcast TV news datasets, REPERE and ETAPE, to demonstrate the validity of our method. Quantitative results in verification and clustering tasks show promising improvement, especially in cases where speaker turns are short or the training data size is limited.
Publisher's Version Article Search
Computer Vision Based Fall Detection by a Convolutional Neural Network
Miao Yu, Liyun Gong, and Stefanos Kollias
(University of Lincoln, UK)
In this work, we propose a novel computer vision based fall detection system, which could be applied for the health-care of the elderly people community. For a recorded video stream, background subtraction is firstly applied to extract the human body silhouette. Extracted silhouettes corresponding to daily activities are applied to construct a convolutional neural network, which is applied for classification of different classes of human postures (e.g., bend, stand, lie and sit) and detection of a fall event (i.e., lying posture is detected in the floor region). As far as we know, this work is the first attempt for the application of the convolutional neural network for the fall detection application. From a dataset of daily activities recorded from multiple people, we show that the proposed method both achieves higher postures classification results than the state-of-the-art classifiers and can successfully detect the fall event with a low false alarm rate.
Publisher's Version Article Search
Predicting Meeting Extracts in Group Discussions using Multimodal Convolutional Neural Networks
Fumio Nihei, Yukiko I. Nakano, and Yutaka Takase
(Seikei University, Japan)
This study proposes the use of multimodal fusion models employing Convolutional Neural Networks (CNNs) to extract meeting minutes from group discussion corpus. First, unimodal models are created using raw behavioral data such as speech, head motion, and face tracking. These models are then integrated into a fusion model that works as a classifier. The main advantage of this work is that the proposed models were trained without any hand-crafted features, and they outperformed a baseline model that was trained using hand-crafted features. It was also found that multimodal fusion is useful in applying the CNN approach to model multimodal multiparty interaction.
Publisher's Version Article Search
The Relationship between Task-Induced Stress, Vocal Changes, and Physiological State during a Dyadic Team Task
Catherine Neubauer, Mathieu Chollet, Sharon Mozgai, Mark Dennison, Peter Khooshabeh, and Stefan Scherer
(Army Research Lab at Playa Vista, USA; University of Southern California, USA)
It is commonly known that a relationship exists between the human voice and various emotional states. Past studies have demonstrated changes in a number of vocal features, such as fundamental frequency f0 and peakSlope, as a result of varying emotional state. These voice characteristics have been shown to relate to emotional load, vocal tension, and, in particular, stress. Although much research exists in the domain of voice analysis, few studies have assessed the relationship between stress and changes in the voice during a dyadic team interaction. The aim of the present study was to investigate the multimodal interplay between speech and physiology during a high-workload, high-stress team task. Specifically, we studied task-induced effects on participants' vocal signals, specifically, the f0 and peakSlope features, as well as participants' physiology, through cardiovascular measures. Further, we assessed the relationship between physiological states related to stress and changes in the speaker's voice. We recruited participants with the specific goal of working together to diffuse a simulated bomb. Half of our sample participated in an "Ice Breaker" scenario, during which they were allowed to converse and familiarize themselves with their teammate prior to the task, while the other half of the sample served as our "Control". Fundamental frequency (f0), peakSlope, physiological state, and subjective stress were measured during the task. Results indicated that f0 and peakSlope significantly increased from the beginning to the end of each task trial, and were highest in the last trial, which indicates an increase in emotional load and vocal tension. Finally, cardiovascular measures of stress indicated that the vocal and emotional load of speakers towards the end of the task mirrored a physiological state of psychological "threat".
Publisher's Version Article Search
Meyendtris: A Hands-Free, Multimodal Tetris Clone using Eye Tracking and Passive BCI for Intuitive Neuroadaptive Gaming
Laurens R. Krol, Sarah-Christin Freytag, and Thorsten O. Zander
(TU Berlin, Germany)
This paper introduces a completely hands-free version of Tetris that uses eye tracking and passive brain-computer interfacing (a real-time measurement and interpretation of brain activity) to replace existing game elements, as well as introduce novel ones. In Meyendtris, dwell time-based eye tracking replaces the game's direct control elements, i.e. the movement of the tetromino. In addition to that, two mental states of the player influence the game in real time by means of passive brain-computer interfacing. First, a measure of the player's relaxation is used to modulate the speed of the game (and the corresponding music). Second, when upon landing of a tetromino a state of error perception is detected in the player's brain, this last landed tetromino is destroyed. Together, this results in a multimodal, hands-free version of the classic Tetris game that is no longer hindered by manual input bottlenecks, while engaging novel mental abilities of the player.
Publisher's Version Article Search
AMHUSE: A Multimodal dataset for HUmour SEnsing
Giuseppe Boccignone, Donatello Conte, Vittorio Cuculo, and Raffaella Lanzarotti
(University of Milan, Italy; University of Tours, France)

We present AMHUSE (A Multimodal dataset for HUmour SEnsing) along with a novel web-based annotation tool named DANTE (Dimensional ANnotation Tool for Emotions). The dataset is the result of an experiment concerning amusement elicitation, involving 36 subjects in order to record the reactions in presence of 3 amusing and 1 neutral video stimuli. Gathered data include RGB video and depth sequences along with physiological responses (electrodermal activity, blood volume pulse, temperature). The videos were later annotated by 4 experts in terms of valence and arousal continuous dimensions. Both the dataset and the annotation tool are made publicly available for research purposes.

Publisher's Version Article Search
GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication
Mohamed Khamis, Mariam Hassib, Emanuel von Zezschwitz, Andreas Bulling, and Florian Alt
(LMU Munich, Germany; Max Planck Institute for Informatics, Germany)
Although mobile devices provide access to a plethora of sensitive data, most users still only protect them with PINs or patterns, which are vulnerable to side-channel attacks (e.g., shoulder surfing). How-ever, prior research has shown that privacy-aware users are willing to take further steps to protect their private data. We propose GazeTouchPIN, a novel secure authentication scheme for mobile devices that combines gaze and touch input. Our multimodal approach complicates shoulder-surfing attacks by requiring attackers to ob-serve the screen as well as the user’s eyes to and the password. We evaluate the security and usability of GazeTouchPIN in two user studies (N=30). We found that while GazeTouchPIN requires longer entry times, privacy aware users would use it on-demand when feeling observed or when accessing sensitive data. The results show that successful shoulder surfing attack rate drops from 68% to 10.4%when using GazeTouchPIN.
Publisher's Version Article Search Video
Multi-task Learning of Social Psychology Assessments and Nonverbal Features for Automatic Leadership Identification
Cigdem Beyan, Francesca Capozzi, Cristina Becchio, and Vittorio Murino
(IIT Genoa, Italy; McGill University, Canada; University of Turin, Italy; University of Verona, Italy)
In social psychology, the leadership investigation is performed using questionnaires which are either i) self-administered or ii) applied to group participants to evaluate other members or iii) filled by external observers. While each of these sources is informative, using them individually might not be as effective as using them jointly. This paper is the first attempt which addresses the automatic identification of leaders in small-group meetings, by learning effective models using nonverbal audio-visual features and the results of social psychology questionnaires that reflect assessments regarding leadership. Learning is based on Multi-Task Learning which is performed without using ground-truth data (GT), but using the results of questionnaires (having substantial agreement with GT), administered to external observers and the participants of the meetings, as tasks. The results show that joint learning results in better performance as compared to single task learning and other baselines.
Publisher's Version Article Search
Multimodal Analysis of Vocal Collaborative Search: A Public Corpus and Results
Daniel McDuff, Paul Thomas, Mary Czerwinski, and Nick Craswell
(Microsoft Research, USA; Microsoft Research, Australia)
Intelligent agents have the potential to help with many tasks. Information seeking and voice-enabled search assistants are becoming very common. However, there remain questions as to the extent by which these agents should sense and respond to emotional signals. We designed a set of information seeking tasks and recruited participants to complete them using a human intermediary. In total we collected data from 22 pairs of individuals, each completing five search tasks. The participants could communicate only using voice, over a VoIP service. Using automated methods we extracted facial action, voice prosody and linguistic features from the audio-visual recordings. We analyzed the characteristics of these interactions that correlated with successful communication and understanding between the pairs. We found that those who were expressive in channels that were missing from the communication channel (e.g., facial actions and gaze) were rated as communicating poorly, being less helpful and understanding. Having a way of reinstating nonverbal cues into these interactions would improve the experience, even when the tasks are purely information seeking exercises. The dataset used for this analysis contains over 15 hours of video, audio and transcripts and reported ratings. It is publicly available for researchers at: http://aka.ms/MISCv1.
Publisher's Version Article Search
UE-HRI: A New Dataset for the Study of User Engagement in Spontaneous Human-Robot Interactions
Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, and Angelica Lim
(Telecom ParisTech, France; University of Paris-Saclay, France; SoftBank Robotics, France)
In this paper, we present a new dataset of spontaneous interactions between a robot and humans, of which 54 interactions (between 4 and 15-minute duration each) are freely available for download and use. Participants were recorded while holding spontaneous conversations with the robot Pepper. The conversations started automatically when the robot detected the presence of a participant and kept the recording if he/she accepted the agreement (i.e. to be recorded). Pepper was in a public space where the participants were free to start and end the interaction when they wished. The dataset provides rich streams of data that could be used by research and development groups in a variety of areas.
Publisher's Version Article Search Info
Mining a Multimodal Corpus of Doctor’s Training for Virtual Patient’s Feedbacks
Chris Porhet, Magalie Ochs, Jorane Saubesty, Grégoire de Montcheuil, and Roxane Bertrand
(Aix-Marseille University, France; CNRS, France; ENSAM, France; University of Toulon, France)
Doctors should be trained not only to perform medical or surgical acts but also to develop competences in communication for their interaction with patients. For instance, the way doctors deliver bad news has a significant impact on the therapeutic process. In order to facilitate the doctors’ training to break bad news, we aim at developing a virtual patient ables to interact in a multimodal way with doctors announcing an undesirable event. One of the key elements to create an engaging interaction is the feedbacks’ behavior of the virtual character. In order to model the virtual patient’s feedbacks in the context of breaking bad news, we have analyzed a corpus of real doctor’s training. The verbal and nonverbal signals of both the doctors and the patients have been annotated. In order to identify the types of feedbacks and the elements that may elicit a feedback, we have explored the corpus based on sequences mining methods. Rules, that have been extracted from the corpus, enable us to determine when a virtual patient should express which feedbacks when a doctor announces a bad new
Publisher's Version Article Search Info
Multimodal Affect Recognition in an Interactive Gaming Environment using Eye Tracking and Speech Signals
Ashwaq Alhargan, Neil Cooke, and Tareq Binjammaz
(University of Birmingham, UK; De Montfort University, UK)
This paper presents a multimodal affect recognition system for interactive virtual gaming environments using eye tracking and speech signals, captured in gameplay scenarios that are designed to elicit controlled affective states based on the arousal and valence dimensions. The Support Vector Machine is employed as a classifier to detect these affective states from both modalities. The recognition results reveal that eye tracking is superior to speech in affect detection and that the two modalities are complementary in the application of interactive gaming. This suggests that it is feasible to design an accurate multimodal recognition system to detect players’ affects from the eye tracking and speech modalities in the interactive gaming environment. We emphasise the potential of integrating the proposed multimodal system into game interfaces to enhance interaction and provide an adaptive gaming experience.
Publisher's Version Article Search


Demonstrations 1

Multimodal Interaction in Classrooms: Implementation of Tangibles in Integrated Music and Math Lessons
Jennifer Müller, Uwe Oestermeier, and Peter Gerjets
(University of Tübingen, Germany; Leibniz-Institut für Wissensmedien, Germany)
This demo presents the multidisciplinary development of the enrichment program “Listening to Math: Kids compose with LEGO”, that aims at fostering the math and music skills of gifted primary school students. In this program LEGO bricks are used as tangible representation of music notes. The so called LEGO-Table, a tabletop computer, translates patterns made out of bricks into piano melodies. We will present the finished program, address our experiences made during the project, and we will also give some insights on the implementation and evaluation of the program. Functions of the LEGO-Table will be demonstrated interactively on a tablet.
Publisher's Version Article Search Info
Web-Based Interactive Media Authoring System with Multimodal Interaction
Bok Deuk Song, Yeon Jun Choi, and Jong Hyun Park
(ETRI, South Korea)
Recently, interactive media which tries to maximize the audience engagement has been widely distributed throughout the digital media environment. For example, developing and promoting a new participatory of media genre with higher immersion which can be applied to advertisement, film, game, and e-learning are actively researched. In this paper, we introduce a new interactive media authoring system that enables bidirectional communication between the user and the applications. This authoring system provides a modality interface which manages interactions between various modalities, i.e. which allows users to choose one or more modalities as inputs and to get other modalities as outputs. This system helps the user to engage with the media and minimizes the image producer-user constraints in various distribution enabled web environment via PC and smart devices, without additional equipment installation.
Publisher's Version Article Search
Textured Surfaces for Ultrasound Haptic Displays
Euan Freeman, Ross Anderson, Julie Williamson, Graham Wilson, and Stephen A. Brewster
(University of Glasgow, UK)
We demonstrate a technique for rendering textured haptic surfaces in mid-air, using an ultrasound haptic display. Our technique renders tessellated 3D `haptic' shapes with different waveform properties, creating surfaces with distinct perceptions.
Publisher's Version Article Search
Rapid Development of Multimodal Interactive Systems: A Demonstration of Platform for Situated Intelligence
Dan Bohus, Sean Andrist, and Mihai Jalobeanu
(Microsoft, USA; Microsoft Research, USA)
We demonstrate an open, extensible platform for developing and studying multimodal, integrative-AI systems. The platform provides a time-aware, stream-based programming model for parallel coordinated computation, a set of tools for data visualization, processing, and learning, and an ecosystem of pluggable AI components. The demonstration will showcase three applications built on this platform and highlight how the platform can significantly accelerate development and research in multimodal interactive systems.
Publisher's Version Article Search
MIRIAM: A Multimodal Chat-Based Interface for Autonomous Systems
Helen Hastie, Francisco Javier Chiyah Garcia, David A. Robb, Pedro Patron, and Atanas Laskov
(Heriot-Watt University, UK; SeeByte, UK)
We present MIRIAM (Multimodal Intelligent inteRactIon for Autonomous systeMs), a multimodal interface to support situation awareness of autonomous vehicles through chat-based interaction. The user is able to chat about the vehicle's plan, objectives, previous activities and mission progress. The system is mixed initiative in that it pro-actively sends messages about key events, such as fault warnings. We will demonstrate MIRIAM using SeeByte's SeeTrack command and control interface and Neptune autonomy simulator.
Publisher's Version Article Search Info
SAM: The School Attachment Monitor
Dong-Bach Vo, Mohammad Tayarani, Maki Rooksby, Rui Huan, Alessandro Vinciarelli, Helen Minnis, and Stephen A. Brewster
(University of Glasgow, UK)
Secure Attachment relationships have been shown to minimise social and behavioural problems in children and boosts resilience to risks such as antisocial behaviour, heart pathologies, and suicide later in life. Attachment assessment is an expensive and time-consuming process that is not often performed. The School Attachment Monitor (SAM) automates Attachment assessment to support expert assessors. It uses doll-play activities with the dolls augmented with sensors and the child's play recorded with cameras to provide data for assessment. Social signal processing tools are then used to analyse the data and to automatically categorize Attachment patterns. This paper presents the current SAM interactive prototype.
Publisher's Version Article Search
The Boston Massacre History Experience
David Novick, Laura Rodriguez, Aaron Pacheco, Aaron Rodriguez, Laura Hinojos, Brad Cartwright, Marco Cardiel, Ivan Gris Sepulveda, Olivia Rodriguez-Herrera, and Enrique Ponce
(University of Texas at El Paso, USA; Black Portal Productions, USA)
The Boston Massacre History Experience is a multimodal interactive virtual-reality application aimed at eighth-grade students. In the application, students students have conversations with seven embodied conversational agents, representing characters such as John Adams, a tea-shop owner, and a redcoat. The experience ends with a conversation with Abigail Adams for students to explain what they learned and a narration by John Adams about events that followed. The application is novel technically because it provides a form of user initiative in conversation and features an unprecedented number of virtual agents, some 486 in all.
Publisher's Version Article Search
Demonstrating TouchScope: A Hybrid Multitouch Oscilloscope Interface
Matthew Heinz, Sven Bertel, and Florian Echtler
(Bauhaus-Universität Weimar, Germany)
We present TouchScope, a hybrid multitouch interface for common off-the-shelf oscilloscopes. Oscilloscopes are a valuable tool for analyzing and debugging electronic circuits, but are also complex scientific instruments. Novices are faced with a seemingly overwhelming array of knobs and buttons, and usually require lengthy training before being able to use these devices productively. TouchScope uses a multitouch tablet in combination with an unmodified off-the-shelf oscilloscope to provide a novice-friendly hybrid interface, combining both the low entry barrier of a touch-based interface and the high degrees of freedom of a conventional button-based interface.
Publisher's Version Article Search
The MULTISIMO Multimodal Corpus of Collaborative Interactions
Maria Koutsombogera and Carl Vogel
(Trinity College, Ireland)
This paper describes a recently created multimodal corpus that has been designed to address multiparty interaction modelling, specifically collaborative aspects in task-based group interactions. A set of human-human interactions was collected with HD cameras, microphones and a Kinect sensor. The scenario involves 2 participants playing a game instructed and guided by a facilitator. Additionally to the recordings, survey material was collected, including personality tests of the participants and experience assessment questionnaires. The corpus will be exploited for modelling behavioral aspects in collaborative group interaction by taking into account the speakers' multimodal signals and psychological variables.
Publisher's Version Article Search
Using Mobile Virtual Reality to Empower People with Hidden Disabilities to Overcome Their Barriers
Matthieu Poyade, Glyn Morris, Ian Taylor, and Victor Portela
(Glasgow School of Art, UK; Friendly Access, UK; Crag3D, UK)
This paper presents a proof of concept for an immersive and interactive mobile application which aims to help people with hidden disabilities to develop tolerance to the environmental stressors that are typically found in crowded public spaces, and more particularly in airports. The application initially proposes the user to rehearse a series of sensory attenuated experiences within digitally reconstructed environments of the Aberdeen International Airport. Throughout rehearsals, environmental stressors are gradually increased making the environments more sensory challenging for the user. Usability and pilot testing provided encouraging outcomes ahead of future developments.
Publisher's Version Article Search

Demonstrations 2

Bot or Not: Exploring the Fine Line between Cyber and Human Identity
Mirjam Wester, Matthew P. Aylett, and David A. Braude
(CereProc, UK)
Speech technology is rapidly entering the everyday through the large scale commercial impact of systems such as Apple Siri and Amazon Echo. Meanwhile technology that allows voice cloning, voice modification, speech recognition, speech analytics and expressive speech synthesis has changed dramatically over recent years. The demonstration, described in this paper, is an educational tool in the form of an online quiz called `Bot or Not'. Using the quiz we have gathered impressions of what people realise is possible with current speech synthesis technology. The opinions of various groups regarding the synthesis of famous voices, sounding like a robot, and the difference between synthesis and voice modification were collected.
Publisher's Version Article Search
Modulating the Non-verbal Social Signals of a Humanoid Robot
Amol Deshmukh, Bart Craenen, Alessandro Vinciarelli, and Mary Ellen Foster
(University of Glasgow, UK)
In this demonstration we present a repertoire of social signals generated by the humanoid robot Pepper in the context of the EU-funded project MuMMER. The aim of this research is to provide the robot with the expressive capabilities required to interact with people in real-world public spaces such as shopping malls-and being able to control the non-verbal behaviour of such a robot is key to engaging with humans in an effective way. We propose an approach to modulating the non-verbal social signals of the robot based on systematically varying the amplitude and speed of the joint motions and gathering user evaluations of the resulting gestures. We anticipate that the humans' perception of the robot behaviour will be influenced by these modulations
Publisher's Version Article Search Video
Thermal In-Car Interaction for Navigation
Patrizia Di Campli San Vito, Stephen A. Brewster, Frank Pollick, and Stuart White
(University of Glasgow, UK; Jaguar Land Rover, UK)
In this demonstration we show a thermal interaction design on the steering wheel for navigational cues in a car. Participants will be able to use a thermally enhanced steering wheel to follow instructions given in a turn-to-turn based navigation task in a virtual city. The thermal cues will be provided on both sides of the steering wheel and will indicate the turning direction by warming the corresponding side, while the opposite side is being cooled.
Publisher's Version Article Search
AQUBE: An Interactive Music Reproduction System for Aquariums
Daisuke Sasaki, Musashi Nakajima, and Yoshihiro Kanno
(Waseda University, Japan; Tokyo Polytechnic University, Japan)
Aqube is an interactive music reproduction system aimed to enhance communication between visitors of aquariums. We propose a musical experience in the aquarium using image processing for visual information of marine organisms and sound reproductions which is related to state of the aquarium. Visitors set three color cubes in front of the aquarium and system produce musical feedback when marine organisms swim over the cube. Visitors will be able to participate producing music of exhibition space collaboratively with Aqube.
Publisher's Version Article Search
Real-Time Mixed-Reality Telepresence via 3D Reconstruction with HoloLens and Commodity Depth Sensors
Michal Joachimczak, Juan Liu, and Hiroshi Ando
(National Institute of Information and Communications Technology, Japan; Osaka University, Japan)
We present a demo of low-cost mixed reality telepresence system that performs real-time 3D reconstruction of a person or an object and wirelessly transmits the reconstructions to Microsoft's HoloLens head mounted display at frame rates perceived as smooth. A reconstructed frame is represented as a polygonal mesh with polygons textured with high definition data obtained from RGB cameras. Each frame is compressed and sent to HoloLens, so that it can be locally rendered by its GPU, minimizing latency of reacting to head movement. Owing to HoloLens half-translucent displays, the system creates an appearance of remote object being part of user's physical environment. The system can run on a relatively low-cost commodity hardware, such as Kinect sensors and in the most basic scenario can produce smooth frame rates of a single multicore laptop.
Publisher's Version Article Search
Evaluating Robot Facial Expressions
Ruth Aylett, Frank Broz, Ayan Ghosh, Peter McKenna, Gnanathusharan Rajendran, Mary Ellen Foster, Giorgio Roffo, and Alessandro Vinciarelli
(Heriot-Watt University, UK; University of Glasgow, UK)
This paper outlines a demonstration of the work carried out in the SoCoRo project investigating how far a neuro-typical population recognises facial expressions on a non-naturalistic robot face that are designed to show approval and disapproval. RFID-tagged objects are presented to an Emys robot head (called Alyx) and Alyx reacts to each with a facial expression. Participants are asked to put the object in a box marked 'Like' or 'Dislike'. This study is being extended to include assessment of participants' Autism Quotient using a validated questionnaire as a step towards using a robot to help train high-functioning adults with an Autism Spectrum Disorder in social signal recognition.
Publisher's Version Article Search
Bimodal Feedback for In-Car Mid-Air Gesture Interaction
Gözel Shakeri, John H. Williamson, and Stephen A. Brewster
(University of Glasgow, UK)
This demonstration showcases novel multimodal feedback designs for in-car mid-air gesture interaction. It explores the potential of multimodal feedback types for mid-air gestures in cars and how these can reduce eyes-off-the-road time thus make driving safer. We will show four different bimodal feedback combinations to provide effective information about interaction with systems in a car. These feedback techniques are visual-auditory, auditory-ambient (peripheral vision), ambient-tactile, and tactile-auditory. Users can interact with the system after a short introduction, creating an exciting opportunity to deploy these displays in cars in the future.
Publisher's Version Article Search
A Modular, Multimodal Open-Source Virtual Interviewer Dialog Agent
Kirby Cofino, Vikram Ramanarayanan, Patrick Lange, David Pautler, David Suendermann-Oeft, and Keelan Evanini
(American University, USA; ETS at San Francisco, USA; ETS at Princeton, USA)
We present an open-source multimodal dialog system equipped with a virtual human avatar interlocutor. The agent, rigged in Blender and developed in Unity with WebGL support, interfaces with the HALEF open-source cloud-based standard-compliant dialog framework. To demonstrate the capabilities of the system, we designed and implemented a conversational job interview scenario where the avatar plays the role of an interviewer and responds to user input in real-time to provide an immersive user experience.
Publisher's Version Article Search
Wearable Interactive Display for the Local Positioning System (LPS)
Daniel M. Lofaro, Christopher Taylor, Ryan Tse, and Donald Sofge
(US Naval Research Lab, USA; George Mason University, USA; Thomas Jefferson High School for Science and Technology, USA)
Traditionally, localization systems for formation-keeping have required external references to a world frame, such as GPS or motion capture. Humans wishing to maintain formations with each other use visual line-of-sight, which is not possible in cluttered environments that may block their view. Our Local Positioning System (LPS) is based on ultra-wideband ranging devices and it enables teams of robots and humans to know their relative locations without line-of-sight. This demonstration shows live (real-time) local positioning between three agents (human and robot). The emphasis of this demonstration is showing the wearable interactive display that allows the human to see the local frame formation of the human and the robots.
Publisher's Version Article Search

Grand Challenge

From Individual to Group-Level Emotion Recognition: EmotiW 5.0
Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon
(IIT Ropar, India; University of Canberra, Australia; University of Waterloo, Canada; Australian National University, Australia)
Research in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects of affective computing. This year there are two sub-challenges: a) Audio-video emotion recognition and b) group-level emotion recognition. These challenges are based on the acted facial expressions in the wild and group affect databases, respectively. The particular focus of the challenge is to evaluate method in `in the wild' settings. `In the wild' here is used to describe the various environments represented in the images and videos, which represent real-world (not lab like) scenarios. The baseline, data, protocol of the two challenges and the challenge participation are discussed in detail in this paper.
Publisher's Version Article Search Info
Multi-modal Emotion Recognition using Semi-supervised Learning and Multiple Neural Networks in the Wild
Dae Ha Kim, Min Kyu Lee, Dong Yoon Choi, and Byung Cheol Song
(Inha University, South Korea)
Human emotion recognition is a research topic that is receiving continuous attention in computer vision and artificial intelligence domains. This paper proposes a method for classifying human emotions through multiple neural networks based on multi-modal signals which consist of image, landmark, and audio in a wild environment. The proposed method has the following features. First, the learning performance of the image-based network is greatly improved by employing both multi-task learning and semi-supervised learning using the spatio-temporal characteristic of videos. Second, a model for converting 1-dimensional (1D) landmark information of face into two-dimensional (2D) images, is newly proposed, and a CNN-LSTM network based on the model is proposed for better emotion recognition. Third, based on an observation that audio signals are often very effective for specific emotions, we propose an audio deep learning mechanism robust to the specific emotions. Finally, so-called emotion adaptive fusion is applied to enable synergy of multiple networks. In the fifth attempt on the given test set in the EmotiW2017 challenge, the proposed method achieved a classification accuracy of 57.12%.
Publisher's Version Article Search
Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild
Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, and Benoit Huet
(University of Modena and Reggio Emilia, Italy; EURECOM, France)
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81 % and 40.47 %): we achieve an accuracy of 50.39 % and 49.92 % respectively on the validation and the testing data.
Publisher's Version Article Search
Group-Level Emotion Recognition using Transfer Learning from Face Identification
Alexandr Rassadin, Alexey Gruzdev, and Andrey Savchenko
(National Research University Higher School of Economics, Russia)
In this paper, we describe our algorithmic approach, which was used for submissions in the fifth Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. We extracted feature vectors of detected faces using the Convolutional Neural Network trained for face identification task, rather than traditional pre-training on emotion recognition problems. In the final pipeline an ensemble of Random Forest classifiers was learned to predict emotion score using available training set. In case when the faces have not been detected, one member of our ensemble extracts features from the whole image. During our experimental study, the proposed approach showed the lowest error rate when compared to other explored techniques. In particular, we achieved 75.4% accuracy on the validation data, which is 20% higher than the handcrafted feature-based baseline. The source code using Keras framework is publicly available.
Publisher's Version Article Search
Group Emotion Recognition with Individual Facial Emotion CNNs and Global Image Based CNNs
Lianzhi Tan, Kaipeng Zhang, Kai Wang, Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao
(SIAT at Chinese Academy of Sciences, China; National Taiwan University, Taiwan)

This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on two types of Convolutional Neural Networks (CNNs), namely individual facial emotion CNNs and global image based CNNs. For the individual facial emotion CNNs, we first extract all the faces in an image, and assign the image label to all faces for training. In particular, we utilize a large-margin softmax loss for discriminative learning and we train two CNNs on both aligned and non-aligned faces. For the global image based CNNs, we compare several recent state-of-the-art network structures and data augmentation strategies to boost performance. For a test image, we average the scores from all faces and the image to predict the final group emotion category. We win the challenge with accuracies 83.9% and 80.9% on the validation set and testing set respectively, which improve the baseline results by about 30%.

Publisher's Version Article Search
Learning Supervised Scoring Ensemble for Emotion Recognition in the Wild
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen
(Intel Labs, China)
State-of-the-art approaches for the previous emotion recognition in the wild challenges are usually built on prevailing Convolutional Neural Networks (CNNs). Although there is clear evidence that CNNs with increased depth or width can usually bring improved predication accuracy, existing top approaches provide supervision only at the output feature layer, resulting in the insufficient training of deep CNN models. In this paper, we present a new learning method named Supervised Scoring Ensemble (SSE) for advancing this challenge with deep CNNs. We first extend the idea of recent deep supervision to deal with emotion recognition problem. Benefiting from adding supervision not only to deep layers but also to intermediate layers and shallow layers, the training of deep CNNs can be well eased. Second, we present a new fusion structure in which class-wise scoring activations at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. We show our proposed learning method brings large accuracy gains over diverse backbone networks consistently. On this year's audio-video based emotion recognition task, the average recognition rate of our best submission is 60.34%, forming a new envelop over all existing records.
Publisher's Version Article Search
Group Emotion Recognition in the Wild by Combining Deep Neural Networks for Facial Expression Classification and Scene-Context Analysis
Asad Abbas and Stephan K. Chalup
(University of Newcastle, Australia)
This paper presents the implementation details of a proposed solution to the Emotion Recognition in the Wild 2017 Challenge, in the category of group-level emotion recognition. The objective of this sub-challenge is to classify a group's emotion as Positive, Neutral or Negative. Our proposed approach incorporates both image context and facial information extracted from an image for classification. We use Convolutional Neural Networks (CNNs) to predict facial emotions from detected faces present in an image. Predicted facial emotions are combined with scene-context information extracted by another CNN using fully connected neural network layers. Various techniques are explored by combining and training these two Deep Neural Network models in order to perform group-level emotion recognition. We evaluate our approach on the Group Affective Database 2.0 provided with the challenge. Experimental evaluations show promising performance improvements, resulting in approximately 37% improvement over the competition's baseline model on the validation dataset.
Publisher's Version Article Search
Temporal Multimodal Fusion for Video Emotion Classification in the Wild
Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie
(Orange Labs, France; Normandy University, France; CNRS, France)

This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework – lying in describing videos by audio and visual features used by a supervised classifier to infer the labels – this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convolutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making generalization difficult. The so-obtained model ranked 4th at the 2017 Emotion in the Wild challenge with the accuracy of 58.8 %.

Publisher's Version Article Search
Audio-Visual Emotion Recognition using Deep Transfer Learning and Multiple Temporal Models
Xi Ouyang, Shigenori Kawaai, Ester Gue Hua Goh, Shengmei Shen, Wan Ding, Huaiping Ming, and Dong-Yan Huang
(Panasonic R&D Center, Singapore; Central China Normal University, China; Institute for Infocomm Research at A*STAR, Singapore)
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenges for the wild emotion recognition. Deep network transfer learning is used for feature extraction. Spatial-temporal model fusion is to make full use of the complementary of different networks. Semi-auto reinforcement learning is for the optimization of fusion strategy based on dynamic outside feedbacks given by challenge organizers. The overall accuracy of the proposed approach on the challenge test dataset is 57.2%, which is better than the challenge baseline of 40.47% .
Publisher's Version Article Search
Multi-Level Feature Fusion for Group-Level Emotion Recognition
B. Balaji and V. Ramana Murthy Oruganti
(Amrita University at Coimbatore, India)
In this paper, the influence of low-level and mid-level features is investigated for image-based group emotion recognition. We hypothesize that the human faces, and the objects surrounding them are major sources of information and thus can serve as mid-level features. Hence, we detect faces and objects using pre-trained Deep Net models. Information from different layers in conjunction with different encoding techniques is extensively investigated to obtain the richest feature vectors. The best result obtained classification accuracy of 65.0% on the validation set, is submitted to the Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. The best feature vector yielded 75.1% on the testing set. Post competition, few more experiments were performed and included the same
Publisher's Version Article Search
A New Deep-Learning Framework for Group Emotion Recognition
Qinglan Wei, Yijia Zhao, Qihua Xu, Liandong Li, Jun He, Lejun Yu, and Bo Sun
(Beijing Normal University, China)
In this paper, we target the Group-level emotion recognition sub-challenge of the fifth Emotion Recognition in the Wild (EmotiW 2017) Challenge, which is based on the Group Affect Database 2.0 containing images of groups of people in a wide variety of social events. We use Seetaface to detect and align the faces in the group images and extract two kinds of face-image visual features: VGGFace-lstm, DCNN-lstm. As group image features, we propose using Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN features, VGG features. To the testing group images on which the faces have been detected, the final emotion is estimated using group image features and face-level visual features. While to the testing group images on which the faces cannot be detected, the face-level visual features are fused for final recognition. The final achievements we have gained are 79.78% accuracy on the Group Affect Database 2.0 testing set, which is much higher than the corresponding baseline results 53.62%.
Publisher's Version Article Search
Emotion Recognition in the Wild using Deep Neural Networks and Bayesian Classifiers
Luca Surace, Massimiliano Patacchiola, Elena Battini Sönmez, William Spataro, and Angelo Cangelosi
(University of Calabria, Italy; Plymouth University, UK; Instanbul Bilgi University, Turkey)
Group emotion recognition in the wild is a challenging problem, due to the unstructured environments in which everyday life pictures are taken. Some of the obstacles for an effective classification are occlusions, variable lighting conditions, and image quality. In this work we present a solution based on a novel combination of deep neural networks and Bayesian classifiers. The neural network works on a bottom-up approach, analyzing emotions expressed by isolated faces. The Bayesian classifier estimates a global emotion integrating top-down features obtained through a scene descriptor. In order to validate the system we tested the framework on the dataset released for the Emotion Recognition in the Wild Challenge 2017. Our method achieved an accuracy of 64.68% on the test set, significantly outperforming the 53.62% competition baseline.
Publisher's Version Article Search
Emotion Recognition with Multimodal Features and Temporal Models
Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, and Yong Qin
(Renmin University of China, China; IBM Research Lab, China)

This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods.

Publisher's Version Article Search
Group-Level Emotion Recognition using Deep Models on Image Scene, Faces, and Skeletons
Xin Guo, Luisa F. Polanía, and Kenneth E. Barner
(University of Delaware, USA; American Family Mutual Insurance Company, USA)
This paper presents the work submitted to the Group-level Emotion Recognition sub-challenge, which is part of the 5th Emotion Recognition in the Wild (EmotiW 2017) Challenge. The task of this sub-challenge is to classify the emotion of a group of people in each image as positive, neutral or negative. To address this task, a hybrid network that incorporates global scene features, skeleton features of the group, and local facial features is developed. Specifically, deep convolutional neural networks (CNNs) are first trained on the faces of the group, the whole images and the skeletons of the group, and then fused to perform the group-level emotion prediction. Experimental results show that the proposed network achieves 80.05% and 80.61% on the validation and testing sets, respectively, outperforming the baseline of 52.97% and 53.62%.
Publisher's Version Article Search

Doctoral Consortium

Towards Designing Speech Technology Based Assistive Interfaces for Children's Speech Therapy
Revathy Nayar
(University of Strathaclyde, UK)
Independent studies by the UK government in 2016 observed that speech sound disorders are the most prevalent condition amongst children. Lack of opportunities for individual practice and increasing caseload on speech therapists affect therapy outcomes. Assistive technological aids that process children’s speech and provide appropriate feedback can support therapy efficiency. Design of such aids are particularly demanding because they require speech recognition at phoneme level with high fidelity coupled with child friendly interfaces. Through this doctoral research we aim to design a mobile application using speech and touch as input modalities and thereby explore the technical challenges in adopting such tools in therapy. The design approach of this study is informed from the empirical field work on user requirements from children’s speech therapists. A stable, non-deviant phoneme recognition accuracy combined with the modeling of progress assessment used in actual clinical practices can feed into the design of a feedback based mobile game to be used to practice therapy exercises.
Publisher's Version Article Search
Social Robots for Motivation and Engagement in Therapy
Katie Winkle
(Bristol Robotics Laboratory, UK)
This extended abstract gives an overview of current doctoral research being undertaken on human robot interaction (HRI) for social robots to be used as motivation and engagement tools in rehabilitative therapies, e.g. physiotherapy, occupational therapy and speech therapy. Specifically the work aims to design the AI, social behaviours and interaction modalities for such a robot, validated through a series of human human interaction (HHI) and HRI studies.
Publisher's Version Article Search
Immersive Virtual Eating and Conditioned Food Responses
Nikita Mae B. Tuanquin
(University of Canterbury, New Zealand)
The objective of this research is to explore the use of Virtual Reality (VR) on food cravings and to introduce a new technique for eliciting eating behavioral changes with the use of VR. The research is comprised of two user studies: one exploring the effects of olfactory cues and active participation in VR on food cravings, and the implementation of the proposed “Redirected Eating” (RE) technique. RE is a method of superimposing virtual visual and olfactory properties onto tracked physical food in VR. These virtual properties can be manipulated to make the users believe that the physical food that they are eating has different properties. These manipulations are then sought to be used to gradually change the food-related behavior of the user towards more healthy food choices.
Publisher's Version Article Search
Towards Edible Interfaces: Designing Interactions with Food
Tom Gayler
(Lancaster University, UK)
Food provides humans with some of the most universal and rich sensory experiences possible. For a long time technology was unable to recreate such experiences but now new innovations are changing that. Using the novel manufacturing technology of 3D printed food, I am developing ‘Edible Interfaces’. My research uses a user-centered research approach to focus on food as material for interactive experience in HCI. This will lead to development of Edible Interfaces that are built on the understanding and application of the experiential affordances of food. Designing with food allows the creation of forms of experience not possible through traditional interfaces. My studies so far have explored the perceptions of 3D printed food and potentials for food to advance affective computing. This knowledge is broadening on-going work in the field of multi-sensory HCI and delivering a new perspective on how we design for experience.
Publisher's Version Article Search
Towards a Computational Model for First Impressions Generation
Beatrice Biancardi
(CNRS, France; UPMC, France)
This paper presents a plan towards a computational model of first impressions generation and its integration in an embodied conversational agent (ECA). The goal is to endow an ECA with the ability to manage the impressions elicited in the user by adapting its behaviour in order to impact the interaction. Our approach starts from studying first impressions mechanisms in human-human interaction, with the goal of investigating whether the same processes occur in the interaction with a virtual agent. We present results obtained from the analysis of a corpus of natural human-human interactions, and our future steps intended to build the impression generation and the adaptation modules for the computational model of the agent.
Publisher's Version Article Search
A Decentralised Multimodal Integration of Social Signals: A Bio-Inspired Approach
Esma Mansouri-Benssassi
(University of St. Andrews, UK)
The ability to integrate information from different sensory modalities in a social context is crucial for achieving an understanding of social cues and gaining useful social interaction and experience. Recent research has focused on multi-modal integration of social signals from visual, auditory, haptic or physiological data. Different data fusion techniques have been designed and developed; however, the majority have not achieved significant accuracy improvement in recognising social cues compared to uni-modal social signal recognition. One of the possible limitations is that these existing approaches have no sufficient capacity to model various types of interactions between different modalities and have not been able to leverage the advantages of multi-modal signals by considering each of them as complementary to the others. This paper introduces ideas and plans for creating a decentralised model for social signals integration inspired by computational models of multi-sensory integration in neuroscience.
Publisher's Version Article Search
Human-Centered Recognition of Children's Touchscreen Gestures
Alex Shaw
(University of Florida, USA)
Touchscreen gestures are an important method of interaction for both children and adults. Automated recognition algorithms are able to recognize adults’ gestures quite well, but recognition rates for children are much lower. My PhD thesis focuses on analyzing children’s touchscreen gestures and using the information gained to develop new, child-centered recognition approaches that can recognize children’s gestures with higher accuracy than existing algorithms. This paper describes past and ongoing work toward this end and outlines the next steps in my PhD work.
Publisher's Version Article Search
Cross-Modality Interaction between EEG Signals and Facial Expression
Soheil Rayatdoost
(University of Geneva, Switzerland)
Learning meaningful representation of cross-modality embedding is at the heart of multimodal emotion recognition. Previous studies report that electroencephalogram (EEG) signals recorded from scalp contain electromyogram (EMG) artifacts. Even though the EMG signals from facial expressions are valuable for emotion recognition they are considered to be unwanted signals for brain computer interfaces. In this work, we aim to develop a model to represent and separate the cross-modality interference between facial EMG and EEG signals. We designed a novel protocol and collected signals during acted and spontaneous expressions and action units to identify their artifacts in all the possible combinations. In future, we will develop a model to predict facial expression artifact in EEG signals. Finally, we will develop a hybrid emotion recognition system which will be able to consider or remove the interference between EEG signal and facial expression. The proposed techniques will be validated on the existing databases and our novel database.
Publisher's Version Article Search
Hybrid Models for Opinion Analysis in Speech Interactions
Valentin Barriere
(Telecom ParisTech, France; University of Paris-Saclay, France)
Sentiment analysis is a trendy domain of Machine Learning which has developed considerably in the last several years. Nevertheless, most of the sentiment analysis systems are general. They do not take profit of the interactional context and all of the possibilities that it brings. My PhD thesis focuses on creating a system that uses the information transmitted between two speakers in order to analyze opinion inside a human-human or a human-agent interaction. This paper outlines a research plan for investigating a system that analyzes opinion in speech interactions, using hybrid discriminative models. We present the state of the art in our domain, then we discuss our prior research in the area and the preliminary results we obtained. Finally, we conclude with the future perspectives we want to explore during the rest of this PhD work.
Publisher's Version Article Search
Evaluating Engagement in Digital Narratives from Facial Data
Rui Huan
(University of Glasgow, UK)
Engagement researchers indicate that the engagement level of people in a narrative has an influence on people's subsequent story-related attitudes and beliefs, which helps psychologists understand people's social behaviours and personal experience. With the arrival of multimedia, the digital narrative combines multimedia features (e.g. varying images, music and voiceover) with traditional storytelling. Research on digital narratives has been widely used in helping students gain problem-solving and presentation skills as well as supporting child psychologists investigating children's social understanding such as family/peer relationships through completing their digital narratives. However, there is little study on the effect of multimedia features in digital narratives on the engagement level of people. This research focuses on measuring the levels of engagement of people in digital narratives and specifically on understanding the media effect of digital narratives on people's engagement levels. Measurement tools are developed and validated through analyses of facial data from different age groups (children and young adults) in watching stories with different media features of digital narratives. Data sources used in this research include a questionnaire with Smileyometer scale and the observation of each participant's facial behaviours.
Publisher's Version Article Search
Social Signal Extraction from Egocentric Photo-Streams
Maedeh Aghaei
(University of Barcelona, Spain)
This paper proposes a system for automatic social pattern characterization using a wearable photo-camera. The proposed pipeline consists of three major steps. First, detection of people with whom the camera wearer interacts and, second, categorization of the detected social interactions into formal and informal. These two steps act at event-level where each potential social event is modeled as a multi-dimensional time-series, whose dimensions correspond to a set of relevant features for each task, and a LSTM network is employed for time-series classification. In the last step, recurrences of the same person across the whole set of social interactions are clustered to achieve a comprehensive understanding of the diversity and frequency of the social relations of the user. Experiments over a dataset acquired by a user wearing a photo-camera during a month show promising results on the task of social pattern characterization from egocentric photo-streams.
Publisher's Version Article Search
Multimodal Language Grounding for Improved Human-Robot Collaboration: Exploring Spatial Semantic Representations in the Shared Space of Attention
Dimosthenis Kontogiorgos
(KTH, Sweden)
There is an increased interest in artificially intelligent technology that surrounds us and takes decisions on our behalf. This creates the need for such technology to be able to communicate with humans and understand natural language and non-verbal behaviour that may carry information about our complex physical world. Artificial agents today still have little knowledge about the physical space that surrounds us and about the objects or concepts within our attention. We are still lacking computational methods in understanding the context of human conversation that involves objects and locations around us. Can we use multimodal cues from human perception of the real world as an example of language learning for robots? Can artificial agents and robots learn about the physical world by observing how humans interact with it and how they refer to it and attend during their conversations? This PhD project’s focus is on combining spoken language and non-verbal behaviour extracted by multi-party dialogue in order to increase context awareness and spatial understanding for artificial agents.
Publisher's Version Article Search

Workshop Summaries

ISIAA 2017: 1st International Workshop on Investigating Social Interactions with Artificial Agents (Workshop Summary)
Thierry Chaminade, Fabrice Lefèvre, Noël Nguyen, and Magalie Ochs
(Aix-Marseille University, France; CNRS, France; University of Avignon, France)
The workshop “Investigating Social Interactions With Artificial Agents” organized within the “International Conference on Multimodal Interactions 2017” attempts to bring together researchers from different fields sharing a similar interest in human interactions with other agents. If interdisciplinarity is necessary to address the question of the “Turing Test”, namely “can an artificial conversational artificial agent be perceived as human”, it is also a very promising new way to investigate social interactions in the first place. Biology is represented by social cognitive neuroscience, aiming to describe the physiology of human social behaviors. Linguistics, from humanities, attempts to characterize a specifically human behavior and language. Social Signal Processing is a recent approach to analyze automatically, using advanced Information Technologies, the behaviors pertaining to natural human interactions. Finally, from Artificial Intelligence, the development of artificial agents, conversational and/or embodied, onscreen or physically attempts to recreate non-human socially interactive agents for a multitude of applications.
Publisher's Version Article Search
WOCCI 2017: 6th International Workshop on Child Computer Interaction (Workshop Summary)
Keelan Evanini, Maryam Najafian, Saeid Safavi, and Kay Berkling
(ETS at Princeton, USA; Massachusetts Institute of Technology, USA; University of Hertfordshire, UK; DHBW Karlsruhe, Germany)
The 6th International Workshop on Child Computer Interaction (WOCCI) was held in conjunction with ICMI 2017 in Glasgow, Scotland, on November 13, 2017. The workshop included ten papers spanning a range of topics relevant to child computer interaction, including speech therapy, reading tutoring, interaction with robots, storytelling using figurines, among others. In addition, an invited talk entitled ``Automatic Recognition of Children's Speech for Child-Computer Interaction'' was given by Martin Russell, Professor of Information Engineering at the University of Birmingham.
Publisher's Version Article Search
MIE 2017: 1st International Workshop on Multimodal Interaction for Education (Workshop Summary)
Gualtiero Volpe, Monica Gori, Nadia Bianchi-Berthouze, Gabriel Baud-Bovy, Paolo Alborno, and Erica Volta
(University of Genoa, Italy; IIT Genoa, Italy; University College London, UK)
The International Workshop on Multimodal Interaction for Education aims at investigating how multimodal interactive systems, firmly grounded on psychophysical, psychological, and pedagogical bases, can be designed, developed, and exploited for enhancing teaching and learning processes in different learning environments, with a special focus on children in the classroom. Whilst the usage of multisensory technologies in the education area is rapidly expanding, the need for solid scientific bases, design guidelines, and appropriate procedures for evaluation is emerging. Moreover, the introduction of multimodal interactive systems in the learning environment needs to develop at the same time suitable pedagogical paradigms. This workshop aims at bringing together researchers and practitioners from different disciplines, including pedagogy, psychology, psychophysics, and computer science - with a particular focus on human-computer interaction, affective computing, and social signal processing - to discuss such challenges under a multidisciplinary perspective. The workshop is partially supported by the EU-H2020-ICT Project weDRAW (http://www.wedraw.eu).
Publisher's Version Article Search
Playlab: Telling Stories with Technology (Workshop Summary)
Julie Williamson, Tom Flint, and Chris Speed
(University of Glasgow, UK; Edinburgh Napier University, UK; Edinburgh College of Art, UK)
This one-day workshop explores how playful interaction can be used to develop technologies for public spaces and create temporal experiences.
Publisher's Version Article Search Info
MHFI 2017: 2nd International Workshop on Multisensorial Approaches to Human-Food Interaction (Workshop Summary)
Carlos Velasco, Anton Nijholt, Marianna Obrist, Katsunori Okajima, Rick Schifferstein, and Charles Spence
(BI Norwegian Business School, Norway; University of Twente, Netherlands; University of Sussex, UK; Yokohama National University, Japan; TU Delft, Netherlands; University of Oxford, UK)
We present an introduction paper to the second version of the workshop on ‘Multisensory Approaches to Human-Food Interaction’ to be held at the 19th ACM International Conference on Multimodal Interaction, which will take place on November 13th, 2017, in Glasgow, Scotland. Here, we describe the workshop’s objectives, the key contributions of the different position papers, and the relevance of their respective topic(s). Both multisensory research and new technologies are evolving fast which has opened up a number of possibilities for designing new ways of interacting with foods and drinks. This workshop highlights the rapidly growing interest in the field of Multisensory Human-Food Interaction, which can, for example, be observed in the variety of novel research and technology developments in the area.
Publisher's Version Article Search

proc time: 0.41