ICMI 2017 – Proceedings

Tablets, Tabletops, and Smartphones: Cross-Platform Comparisons of Children’s Touchscreen Interactions
Julia Woodward, Alex Shaw, Aishat Aloba, Ayushi Jain, Jaime Ruiz, and Lisa Anthony
(University of Florida, USA)
The proliferation of smartphones and tablets has increased children’s access to and usage of touchscreen devices. Prior work on smartphones has shown that children’s touch interactions differ from adults’. However, larger screen devices like tablets and tabletops have not been studied at the same granularity for children as smaller devices. We present two studies: one of 13 children using tablets with pen and touch, and one of 18 children using a touchscreen tabletop device. Participants completed target touching and gesture drawing tasks. We found significant differences in performance by modality for tablet: children responded faster and slipped less with touch than pen. In the tabletop study, children responded more accurately to changing target locations (fewer holdovers), and were more accurate touching targets around the screen. Gesture recognition rates were consistent across devices. We provide design guidelines for children’s touchscreen interactions across screen sizes to inform the design of future touchscreen applications for children.

Toward an Efficient Body Expression Recognition Based on the Synthesis of a Neutral Movement
Arthur Crenn, Alexandre Meyer, Rizwan Ahmed Khan, Hubert Konik, and Saida Bouakaz
(University of Lyon, France; University of Saint-Etienne, France)
We present a novel framework for the recognition of body expressions using human postures. Proposed system is based on analyzing the spectral difference between an expressive and a neutral animation. Second problem that has been addressed in this paper is formalization of neutral animation. Formalization of neutral animation has not been tackled before and it can be very useful for the domain of synthesis of animation, recognition of expressions, etc. In this article, we proposed a cost function to synthesize a neutral motion from expressive motion. The cost function formalizes a neutral motion by computing the distance and by combining it with acceleration of each body joints during a motion. We have evaluated our approach on several databases with heterogeneous movements and body expressions. Our body expression recognition results exceeds state of the art on evaluated databases.

Interactive Narration with a Child: Impact of Prosody and Facial Expressions
Ovidiu Șerban, Mukesh Barange, Sahba Zojaji, Alexandre Pauchet, Adeline Richard, and Emilie Chanoni
(Normandy University, France; University of Rouen, France)
Intelligent Virtual Agents are suitable means for interactive storytelling for children. The engagement level of child interaction with virtual agents is a challenging issue in this area. However, the characteristics of child-agent interaction received moderate to little attention in scientific studies whereas such knowledge may be crucial to design specific applications.
This article proposes a Wizard of Oz platform for interactive narration. An experimental study in the context of interactive storytelling exploiting this platform is presented to evaluate the impact of agent prosody and facial expressions on child participation during storytelling. The results show that the use of the virtual agent with prosody and facial expression modalities improves the engagement of children in interaction during the narrative sessions.

Comparing Human and Machine Recognition of Children’s Touchscreen Stroke Gestures
Alex Shaw, Jaime Ruiz, and Lisa Anthony
(University of Florida, USA)
Children's touchscreen stroke gestures are poorly recognized by existing recognition algorithms, especially compared to adults' gestures. It seems clear that improved recognition is necessary, but how much is realistic? Human recognition rates may be a good starting point, but no prior work exists establishing an empirical threshold for a target accuracy in recognizing children's gestures based on human recognition. To this end, we present a crowdsourcing study in which naïve adult viewers recruited via Amazon Mechanical Turk were asked to classify gestures produced by 5- to 10-year-old children. We found a significant difference between human (90.60%) and machine (84.14%) recognition accuracy, over all ages. We also found significant differences between human and machine recognition of gestures of different types: humans perform much better than machines do on letters and numbers versus symbols and shapes. We provide an empirical measure of the accuracy that future machine recognition should aim for, as well as a guide for which categories of gestures have the most room for improvement in automated recognition. Our findings will inform future work on recognition of children's gestures and improving applications for children.

Oral Session 2: Understanding Human Behaviour

Virtual Debate Coach Design: Assessing Multimodal Argumentation Performance
Volha Petukhova, Tobias Mayer, Andrei Malchanau, and Harry Bunt
(Saarland University, Germany; Tilburg University, Netherlands)
This paper discusses the design and evaluation of a coaching system used to train young politicians to apply appropriate multimodal rhetoric devices to improve their debate skills. The presented study is carried out to develop debate performance assessment methods and interaction models underlying a Virtual Debate Coach (VDC) application. We identify a number of criteria associated with three questions: (1) how convincing is a debater's argumentation; (2) how well are debate arguments structured; and (3) how well is an argument delivered. We collected and analysed multimodal data of trainees' debate behaviour, and contrasted it with that of skilled professional debaters. Observational, correlation and machine learning experiments were performed to identify multimodal correlates of convincing debate performance and link them to experts' assessments. A rich set of prosodic, motion, linguistic and structural features was considered for the system to operate on. The VDC system was positively evaluated in a trainee-based setting.

Predicting the Distribution of Emotion Perception: Capturing Inter-rater Variability
Biqiao Zhang, Georg Essl, and Emily Mower Provost
(University of Michigan, USA; University of Wisconsin-Milwaukee, USA)
Emotion perception is person-dependent and variable. Dimensional characterizations of emotion can capture this variability by describing emotion in terms of its properties (e.g., valence, positive vs. negative, and activation, calm vs. excited). However, in many emotion recognition systems, this variability is often considered "noise" and is attenuated by averaging across raters. Yet, inter-rater variability provides information about the subtlety or clarity of an emotional expression and can be used to describe complex emotions. In this paper, we investigate methods that can effectively capture the variability across evaluators by predicting emotion perception as a discrete probability distribution in the valence-activation space. We propose: (1) a label processing method that can generate two-dimensional discrete probability distributions of emotion from a limited number of ordinal labels; (2) a new approach that predicts the generated probabilistic distributions using dynamic audio-visual features and Convolutional Neural Networks (CNNs). Our experimental results on the MSP-IMPROV corpus suggest that the proposed approach is more effective than the conventional Support Vector Regressions (SVRs) approach with utterance-level statistical features, and that feature-level fusion of the audio and video modalities outperforms decision-level fusion. The proposed CNN model predominantly improves the prediction accuracy for the valence dimension and brings a consistent performance improvement over data recorded from natural interactions. The results demonstrate the effectiveness of generating emotion distributions from limited number of labels and predicting the distribution using dynamic features and neural networks.

Automatically Predicting Human Knowledgeability through Non-verbal Cues
Abdelwahab Bourai, Tadas Baltrušaitis, and Louis-Philippe Morency
(Carnegie Mellon University, USA)
Humans possess an incredible ability to transmit and decode ``metainformation" through non-verbal actions in daily communication amongst each other. One communicative phenomena that is transmitted through these subtle cues is knowledgeability. In this work, we conduct two experiments. First, we analyze which non-verbal features are important for identifying knowledgeable people when responding to a question. Next, we train a model to predict the knowledgeability of speakers in a game show setting. We achieve results that surpass chance and human performance using a multimodal approach fusing prosodic and visual features. We believe computer systems that can incorporate emotional reasoning of this level can greatly improve human-computer communication and interaction.

Pooling Acoustic and Lexical Features for the Prediction of Valence
Zakaria Aldeneh, Soheil Khorram, Dimitrios Dimitriadis, and Emily Mower Provost
(University of Michigan, USA; IBM Research, USA)
In this paper, we present an analysis of different multimodal fusion approaches in the context of deep learning, focusing on pooling intermediate representations learned for the acoustic and lexical modalities. Traditional approaches to multimodal feature pooling include: concatenation, element-wise addition, and element-wise multiplication. We compare these traditional methods to outer-product and compact bilinear pooling approaches, which consider more comprehensive interactions between features from the two modalities. We also study the influence of each modality on the overall performance of a multimodal system. Our experiments on the IEMOCAP dataset suggest that: (1) multimodal methods that combine acoustic and lexical features outperform their unimodal counterparts; (2) the lexical modality is better for predicting valence than the acoustic modality; (3) outer-product-based pooling strategies outperform other pooling strategies.

Oral Session 3: Touch and Gesture

Hand-to-Hand: An Intermanual Illusion of Movement
Dario Pittera, Marianna Obrist, and Ali Israr
(Disney Research, USA; University of Sussex, UK)
Apparent tactile motion has been shown to occur across many contiguous part of the body, such as fingers, forearms, and back. A recent study demonstrated the possibility of eliciting the illusion of movement from one hand to the other when interconnected by a tablet. In this paper we explore intermanual apparent tactile motion without any object between them. In a series of psychophysical experiments we determine the control space for generating smooth and consistent motion, using two vibrating handles which we refer to as the Hand-to-Hand vibrotactile device. In a first experiment we investigated the occurrence of the phenomenon (i.e., movement illusion) and the generation of a perceptive model. In a second experiment, based on those results, we investigated the effect of hand postures on the illusion. Finally, in a third experiment we explored two visuo-tactile matching tasks in a multimodal VR setting. Our results can be applied in VR applications with intermanual tactile interactions.

An Investigation of Dynamic Crossmodal Instantiation in TUIs
Feng Feng and Tony Stockman
(Queen Mary University of London, UK)
There is a growing research interest in combining crossmodal research with tangible interaction design. However, most behavioural research on crossmodal perception and on multimodal tangible interaction evaluation are based on static and non-interactive signals. Grounded in the cognitive neuroscience studies and behavioural research on multisensory perception, we present two interactive, dynamic crossmodal studies based on unimodal and crossmodal physical dimensions through an interactive table-top. We tested the implementation in two user studies. Results indicate that first, dynamic crossmodal congruency was spontaneously invoked during the tangible interaction and elicited better user performance than uni-modal instantiation in the conditions involving vision and haptics, second, crossmodal instantiations using all three modalities may not produce the best user performance in the interaction, and third, participant's subconscious reaction towards perceived stimuli during interaction may not always consistent with their explicit verbalization.

“Stop over There”: Natural Gesture and Speech Interaction for Non-critical Spontaneous Intervention in Autonomous Driving
Robert Tscharn, Marc Erich Latoschik, Diana Löffler, and Jörn Hurtienne
(University of Würzburg, Germany)
We propose a new multimodal input technique for Non-critical Spontaneous Situations (NCSSs) in autonomous driving scenarios such as selecting a parking lot or picking up a hitchhiker. Speech and deictic (pointing) gestures were combined to instruct the car about desired interventions which include spatial references to the current environment (e.g., ''stop over [pointing] there'' or ''take [pointing] this parking lot''). In this way, advantages from both modalities were exploited: Speech allows for selecting from many maneuvres and functions in the car (e.g., stop, park), whereas deictic gestures provide a natural and intuitive way of indicating spatial discourse referents used in these interventions (e.g., near this tree, that parking lot). The speech and pointing gesture input was compared to speech and touch-based input in a user study with 38 participants. The touch-based input was selected as a baseline due to its widespread use in in-car touch screens. The evaluation showed that speech and pointing gestures are perceived more natural, intuitive and less cognitively demanding compared to speech and touch and are thus recommended as NCSSs intervention technique for autonomous driving.

Pre-touch Proxemics: Moving the Design Space of Touch Targets from Still Graphics towards Proxemic Behaviors
Ilhan Aslan and Elisabeth André
(University of Augsburg, Germany)
Proxemic touch targets continuously change in relation to a user's hand in mid-air before a physical touch occurs. Previous work has, for example shown that expanding targets are capable to improve target acquisition performance on touch interfaces. However, it is unclear how proxemic touch targets influence user experience (UX) in a broader sense, including hedonic qualities. Towards closing this research gap the paper reports on two user studies. The first study is a qualitative study with five experts, providing in-depth insights on a variety of change-types (e.g., size, form, color) and how they, for example, influence perceived functional and aesthetic qualities of proxemic touch targets. A follow-up user study with 36 participants explores the UX of a proxemic touch target compared to non-proximal versions of the same target. The results highlight a positive significant effect of the proxemic design on both pragmatic and hedonic qualities.

Freehand Grasping in Mixed Reality: Analysing Variation during Transition Phase of Interaction
Maadh Al-Kalbani, Maite Frutos-Pascual, and Ian Williams
(Birmingham City University, UK)
This paper presents an assessment of the variability in freehand grasping of virtual objects in an exocentric mixed reality environment. We report on an experiment covering 480 grasp based motions in the transition phase of interaction to determine the level of variation to the grasp aperture. Controlled laboratory conditions were used where 30 right-handed participants were instructed to grasp and move an object (cube or sphere) using a single medium wrap grasp from a starting location (A) to a target location (B) in a controlled manner. We present a comprehensive statistical analysis of the results showing the variation in grasp change during this phase of interaction. In conclusion we detail recommendations for freehand virtual object interaction design notably that considerations should be given to the change in grasp aperture over the transition phase of interaction.

Rhythmic Micro-Gestures: Discreet Interaction On-the-Go
Euan Freeman, Gareth Griffiths, and Stephen A. Brewster
(University of Glasgow, UK)
We present rhythmic micro-gestures, micro-movements of the hand that are repeated in time with a rhythm. We present a user study that investigated how well users can perform rhythmic micro-gestures and if they can use them eyes-free with non-visual feedback. We found that users could successfully use our interaction technique (97% success rate across all gestures) with short interaction times, rating them as low difficulty as well. Simple audio cues that only convey the rhythm outperformed animations showing the hand movements, supporting rhythmic micro-gestures as an eyes-free input technique.

Oral Session 4: Sound and Interaction

Evaluation of Psychoacoustic Sound Parameters for Sonification
Jamie Ferguson and Stephen A. Brewster
(University of Glasgow, UK)
Sonification designers have little theory or experimental evidence to guide the design of data-to-sound mappings. Many mappings use acoustic representations of data values which do not correspond with the listener's perception of how that data value should sound during sonification. This research evaluates data-to-sound mappings that are based on psychoacoustic sensations, in an attempt to move towards using data-to-sound mappings that are aligned with the listener's perception of the data value's auditory connotations. Multiple psychoacoustic parameters were evaluated over two experiments, which were designed in the context of a domain-specific problem - detecting the level of focus of an astronomical image through auditory display. Recommendations for designing sonification systems with psychoacoustic sound parameters are presented based on our results.

Utilising Natural Cross-Modal Mappings for Visual Control of Feature-Based Sound Synthesis
Augoustinos Tsiros and Grégory Leplâtre
(Edinburgh Napier University, UK)
This paper presents the results of an investigation into audio-visual (AV) correspondences conducted as part of the development of Morpheme, a painting interface to control a corpus-based concatenative sound synthesis algorithm. Previous research has identified strong AV correspondences between dimensions such as pitch and vertical position or loudness and size. However, these correspondences are usually established empirically by only varying a single audio or visual parameter. Although it is recognised that the perception of AV correspondences is affected by the interaction between the parameters of auditory or visual stimuli when these are complex multidimensional objects, there has been little research into perceived AV correspondences when complex dynamic sounds are involved. We conducted an experiment in which two AV mapping strategies and three audio corpora were empirically evaluated. 110 participants were asked to rate the perceived similarity of six AV associations. The results confirmed that size/loudness, vertical position/pitch, colour brightness/spectral brightness are strongly associated. A weaker but significant association was found between texture granularity and sound dissonance, as well as colour complexity and sound dissonance. Harmonicity was found to have a moderating effect on the perceived strengths of these associations: the higher the harmonicity of the sounds, the stronger the perceived AV associations.

Oral Session 5: Methodology

Automatic Classification of Auto-correction Errors in Predictive Text Entry Based on EEG and Context Information
Felix Putze, Maik Schünemann, Tanja Schultz, and Wolfgang Stuerzlinger

(University of Bremen, Germany; Simon Fraser University, Canada)
State-of-the-art auto-correction methods for predictive text entry systems work reasonably well, but can never be perfect due to the properties of human language. We present an approach for the automatic detection of erroneous auto-corrections based on brain activity and text-entry-based context features. We describe an experiment and a new system for the classification of human reactions to auto-correction errors. We show how auto-correction errors can be detected with an average accuracy of 85%.

Cumulative Attributes for Pain Intensity Estimation
Joy O. Egede and Michel Valstar
(University of Nottingham at Ningbo, China; University of Nottingham, UK)
Pain estimation from face video is a hard problem in automatic behaviour understanding. One major obstacle is the difficulty of collecting sufficient amounts of data, with balanced amounts of data for all pain intensity levels. To overcome this, we propose to adopt Cumulative Attributes, which assume that attributes for high pain levels with few examples are a superset of all attributes of lower pain levels. Experimental results show a consistent relative performance increase in the order of 20% regardless of features used. Our final system significantly outperforms the state of the art on the UNBC McMaster Shoulder Pain database by using cumulative attributes with Relevance Vector Regression on a combination of features, including appearance, geometric, and deep learned features.

Towards the Use of Social Interaction Conventions as Prior for Gaze Model Adaptation
Rémy Siegfried, Yu Yu, and Jean-Marc Odobez
(Idiap, Switzerland; EPFL, Switzerland)
Gaze is an important non-verbal cue involved in many facets of social interactions like communication, attentiveness or attitudes. Nevertheless, extracting gaze directions visually and remotely usually suffers large errors because of low resolution images, inaccurate eye cropping, or large eye shape variations across the population, amongst others. This paper hypothesizes that these challenges can be addressed by exploiting multimodal social cues for gaze model adaptation on top of an head-pose independent 3D gaze estimation framework. First, a robust eye cropping refinement is achieved by combining a semantic face model with eye landmark detections. Investigations on whether temporal smoothing can overcome instantaneous refinement limitations is conducted. Secondly, to study whether social interaction convention could be used as priors for adaptation, we exploited the speaking status and head pose constraints to derive soft gaze labels and infer person-specific gaze bias using robust statistics. Experimental results on gaze coding in natural interactions from two different settings demonstrate that the two steps of our gaze adaptation method contribute to reduce gaze errors by a large margin over the baseline and can be generalized to several identities in challenging scenarios.

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and Louis-Philippe Morency
(Carnegie Mellon University, USA)
With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we propose a novel deep architecture for multimodal sentiment analysis that is able to perform modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding allows us to alleviate the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention can perform word level fusion at a finer fusion resolution between the input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. These results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion.

IntelliPrompter: Speech-Based Dynamic Note Display Interface for Oral Presentations
Reza Asadi, Ha Trinh, Harriet J. Fell, and Timothy W. Bickmore
(Northeastern University, USA)
The fear of forgetting what to say next is a common problem in oral presentation delivery. To maintain good content coverage and reduce anxiety, many presenters use written notes in conjunction with presentation slides. However, excessive use of notes during delivery can lead to disengagement with the audience and low quality presentations. We designed IntelliPrompter, a speech-based note display system that automatically tracks a presenter’s coverage of each slide’s content and dynamically adjusts the note display interface to highlight the most likely next topic to present. We developed two versions of our intelligent teleprompters using Google Glass and computer screen displays. The design of our system was informed by findings from 36 interviews with presenters and analysis of a presentation note corpus. In a within-subjects study comparing our dynamic screen-based and Google Glass note display interfaces with a static note system, presenters and independent judges expressed a strong preference for the dynamic screen-based system.

Oral Session 6: Artificial Agents and Wearable Sensors

Head and Shoulders: Automatic Error Detection in Human-Robot Interaction
Pauline Trung, Manuel Giuliani, Michael Miksch, Gerald Stollnberger, Susanne Stadler, Nicole Mirnig, and Manfred Tscheligi
(University of Salzburg, Austria; University of the West of England, UK; Austrian Institute of Technology, Austria)
We describe a novel method for automatic detection of errors in human-robot interactions. Our approach is to detect errors based on the classification of head and shoulder movements of humans who are interacting with erroneous robots. We conducted a user study in which participants interacted with a robot that we programmed to make two types of errors: social norm violations and technical failures. During the interaction, we recorded the behavior of the participants with a Kinect v1 RGB-D camera. Overall, we recorded a data corpus of 237,998 frames at 25 frames per second; 83.48% frames showed no error situation; 16.52% showed an error situation. Furthermore, we computed six different feature sets to represent the movements of the participants and temporal aspects of their movements. Using this data we trained a rule learner, a Naive Bayes classifier, and a k-nearest neighbor classifier and evaluated the classifiers with 10-fold cross validation and leave-one-out cross validation. The results of this evaluation suggest the following: (1) The detection of an error situation works well, when the robot has seen the human before; (2) Rule learner and k-nearest neighbor classifiers work well for automated error detection when the robot is interacting with a known human; (3) For unknown humans, the Naive Bayes classifier performed the best; (4) The classification of social norm violations does perform the worst; (5) There was no big performance difference between using the original data and normalized feature sets that represent the relative position of the participants.

The Reliability of Non-verbal Cues for Situated Reference Resolution and Their Interplay with Language: Implications for Human Robot Interaction
Stephanie Gross, Brigitte Krenn, and Matthias Scheutz
(Austrian Research Institute for Artificial Intelligence, Austria; Tufts University, USA)
When uttering referring expressions in situated task descriptions, humans naturally use verbal and non-verbal channels to transmit information to their interlocutor. To develop mechanisms for robot architectures capable of resolving object references in such interaction contexts, we need to better understand the multi-modality of human situated task descriptions. In current computational models, mainly pointing gestures, eye gaze, and objects in the visual field are included as non-verbal cues, if any. We analyse reference resolution to objects in an object manipulation task and find that only up to 50% of all referring expressions to objects can be resolved including language, eye gaze and pointing gestures. Thus, we extract other non-verbal cues necessary for reference resolution to objects, investigate the reliability of the different verbal and non-verbal cues, and formulate lessons for the design of a robot’s natural language understanding capabilities.

Do You Speak to a Human or a Virtual Agent? Automatic Analysis of User’s Social Cues during Mediated Communication
Magalie Ochs, Nathan Libermann, Axel Boidin, and Thierry Chaminade
(Aix-Marseille University, France; University of Toulon, France; Picxel, France)
While several research works have shown that virtual agents are able to generate natural and social behaviors from users, few of them have compared these social reactions to those expressed dur- ing a human-human mediated communication. In this paper, we propose to explore the social cues expressed by a user during a mediated communication either with an embodied conversational agent or with another human. For this purpose, we have exploited a machine learning method to identify the facial and head social cues characteristics in each interaction type and to construct a model to automatically determine if the user is interacting with a virtual agent or another human. ‘e results show that, in fact, the users do not express the same facial and head movements during a communication with a virtual agent or another user. Based on these results, we propose to use such a machine learning model to automatically measure the social capability of a virtual agent to generate a social behavior in the user comparable to a human- human interaction. ‘e resulting model can detect automatically if the user is communicating with a virtual or real interlocutor, looking only at the user’s face and head during one second.

Estimating Verbal Expressions of Task and Social Cohesion in Meetings by Quantifying Paralinguistic Mimicry
Marjolein C. Nanninga, Yanxia Zhang, Nale Lehmann-Willenbrock, Zoltán Szlávik, and Hayley Hung
(Delft University of Technology, Netherlands; University of Amsterdam, Netherlands; VU University Amsterdam, Netherlands)
In this paper we propose a novel method of estimating verbal expressions of task and social cohesion by quantifying the dynamic alignment of nonverbal behaviors in speech. As team cohesion has been linked to team effectiveness and productivity, automatically estimating team cohesion can be a useful tool for assessing meeting quality and broader team functioning. In total, more than 20 hours of business meetings (3-8 people) were recorded and annotated for behavioral indicators of group cohesion, distinguishing between social and task cohesion. We hypothesized that behaviors commonly referred to as mimicry can be indicative of verbal expressions of social and task cohesion. Where most prior work targets mimicry of dyads, we investigated the effectiveness of quantifying group-level phenomena. A dynamic approach was adopted in which both the cohesion expressions and the paralinguistic mimicry were quantified on small time windows. By extracting features solely related to the alignment of paralinguistic speech behavior, we found that 2-minute high and low social cohesive regions could be classified with a 0.71 Area under the ROC curve, performing on par with the state-of-the-art where turn-taking features were used. Estimating task cohesion was more challenging, obtaining an accuracy of 0.64 AUC, outperforming the state-of-the-art. Our results suggest that our proposed methodology is successful in quantifying group-level paralinguistic mimicry. As both the state-of-the-art turn-taking features and mimicry features performed worse on estimating task cohesion, we conclude that social cohesion is more openly expressed by nonverbal vocal behavior than task cohesion.

Data Augmentation of Wearable Sensor Data for Parkinson’s Disease Monitoring using Convolutional Neural Networks
Terry T. Um, Franz M. J. Pfister, Daniel Pichler, Satoshi Endo, Muriel Lang, Sandra Hirche, Urban Fietzek, and Dana Kulić
(University of Waterloo, Canada; LMU Munich, Germany; TU Munich, Germany; Schön Klinik München Schwabing, Germany)
While convolutional neural networks (CNNs) have been successfully applied to many challenging classification applications, they typically require large datasets for training. When the availability of labeled data is limited, data augmentation is a critical preprocessing step for CNNs. However, data augmentation for wearable sensor data has not been deeply investigated yet.
In this paper, various data augmentation methods for wearable sensor data are proposed. The proposed methods and CNNs are applied to the classification of the motor state of Parkinson’s Disease patients, which is challenging due to small dataset size, noisy labels, and large intra-class variability. Appropriate augmentation improves the classification performance from 77.54% to 86.88%.

Poster Session 1

Automatic Assessment of Communication Skill in Non-conventional Interview Settings: A Comparative Study
Pooja Rao S. B, Sowmya Rasipuram, Rahul Das, and Dinesh Babu Jayagopi
(IIIT Bangalore, India)
Effective communication is an important social skill that facilitates us to interpret and connect with people around us and is of utmost importance in employment based interviews. This paper presents a methodical study and automatic measurement of communication skill of candidates in different modes of behavioural interviews. It demonstrates a comparative analysis of non-conventional methods of employment interviews namely 1) Interface-based asynchronous video interviews and 2) Written interviews (including a short essay). In order to achieve this, we have collected a dataset of 100 structured interviews from participants. These interviews are evaluated independently by two human expert annotators on rubrics specific to each of the settings. We, then propose a predictive model using automatically extracted multimodal features like audio, visual and lexical, applying classical machine learning algorithms. Our best model performs with an accuracy of 75% for a binary classification task in all the three contexts. We also study the differences between the expert perception and the automatic prediction across the settings.

Low-Intrusive Recognition of Expressive Movement Qualities
Radoslaw Niewiadomski, Maurizio Mancini, Stefano Piana, Paolo Alborno, Gualtiero Volpe, and Antonio Camurri
(University of Genoa, Italy)
In this paper we present a low-intrusive approach to the detection of expressive full-body movement qualities. We focus on two qualities: Lightness and Fragility and we detect them using the data captured by four wearable devices, two Inertial Movement Units (IMU) and two electromyographs (EMG), placed on the forearms. The work we present in the paper stems from a strict collaboration with expressive movement experts (e.g., contemporary dance choreographers) for defining a vocabulary of basic movement qualities. We recorded 13 dancers performing movements expressing the qualities under investigation. The recordings were next segmented and the perceived level of each quality for each segment was ranked by 5 experts using a 5-points Likert scale. We obtained a dataset of 150 segments of movement expressing Fragility and/or Lightness. In the second part of the paper, we define a set of features on IMU and EMG data and we extract them on the recorded corpus. We finally applied a set of supervised machine learning techniques to classify the segments. The best results for the whole dataset were obtained with a Naive Bayes classifier for Lightness (F-score 0.77), and with a Support Vector Machine classifier for Fragility (F-score 0.77). Our approach can be used in ecological contexts e.g., during artistic performances.

Digitising a Medical Clerking System with Multimodal Interaction Support
Harrison South, Martin Taylor, Huseyin Dogan, and Nan Jiang
(Bournemouth University, UK; Royal Bournemouth and Christchurch Hospital, UK)
The Royal Bournemouth and Christchurch Hospitals (RBCH) use a series of paper forms to record their interactions with patients; while these have been highly successful, the world is moving digitally and the National Health Service (NHS) has planned to be completely paperless by 2020. Using a project management methodology called Scrum that is supported by a usability evaluation technique called System Usability Scale (SUS) and a workload measurement technique called NASA TLX, a prototype web application system was built and evaluated for the client. The prototype used a varied set of input mediums including voice, text and stylus to ensure that users were more likely to adopt the system. This web based system was successfully developed and evaluated at RBCH. This evaluation showed that the application was usable and accessible but raised many different questions about the nature of software in hospitals. While the project looked at how different input mediums can be used in a hospital, it found that just because it is possible to input data is some familiar format (e.g. voice), it is not always in the best interest of the end-users and the patients.

GazeTap: Towards Hands-Free Interaction in the Operating Room
Benjamin Hatscher, Maria Luz, Lennart E. Nacke

, Norbert Elkmann, Veit Müller, and Christian Hansen
(University of Magdeburg, Germany; University of Waterloo, Canada; Fraunhofer IFF, Germany)
During minimally-invasive interventions, physicians need to interact with medical image data, which cannot be done while the hands are occupied. To address this challenge, we propose two interaction techniques which use gaze and foot as input modalities for hands-free interaction. To investigate the feasibility of these techniques, we created a setup consisting of a mobile eye-tracking device, a tactile floor, two laptops, and the large screen of an angiography suite. We conducted a user study to evaluate how to navigate medical images without the need for hand interaction. Both multimodal approaches, as well as a foot-only interaction technique, were compared regarding task completion time and subjective workload. The results revealed comparable performance of all methods. Selection is accomplished faster via gaze than with a foot only approach, but gaze and foot easily interfere when used at the same time. This paper contributes to HCI by providing techniques and evaluation results for combined gaze and foot interaction when standing. Our method may enable more effective computer interactions in the operating room, resulting in a more beneficial use of medical information.

Boxer: A Multimodal Collision Technique for Virtual Objects
Byungjoo Lee, Qiao Deng, Eve Hoggan, and Antti Oulasvirta
(Aalto University, Finland; KAIST, South Korea; Aarhus University, Denmark)
Virtual collision techniques are interaction techniques for invoking discrete events in a virtual scene, e.g. throwing, pushing, or pulling an object with a pointer. The conventional approach involves detecting collisions as soon as the pointer makes contact with the object. Furthermore, in general, motor patterns can only be adjusted based on visual feedback. The paper presents a multimodal technique based on the principle that collisions should be aligned with the most salient sensory feedback. Boxer (1) triggers a collision at the moment where the pointer's speed reaches a minimum after first contact and (2) is synchronized with vibrotactile stimuli presented to the hand controlling the pointer. Boxer was compared with the conventional technique in two user studies (with temporal pointing and virtual batting). Boxer improved spatial precision in collisions by 26.7 % while accuracy was compromised under some task conditions. No difference was found in temporal precision. Possibilities for improving virtual collision techniques are discussed.

Trust Triggers for Multimodal Command and Control Interfaces
Helen Hastie, Xingkun Liu, and Pedro Patron
(Heriot-Watt University, UK; SeeByte, UK)
For autonomous systems to be accepted by society and operators, they have to instil the appropriate level of trust. In this paper, we discuss what dimensions constitute trust and examine certain triggers of trust for an autonomous underwater vehicle, comparing a multimodal command and control interface with a language-only reporting system. We conclude that there is a relationship between perceived trust and the clarity of a user's Mental Model and that this Mental Model is clearer in a multimodal condition, compared to language-only. Regarding trust triggers, we are able to show that a number of triggers, such as anomalous sensor readings, noticeably modify the perceived trust of the subjects, but in an appropriate manner, thus illustrating the utility of the interface.

Info

TouchScope: A Hybrid Multitouch Oscilloscope Interface
Matthew Heinz, Sven Bertel, and Florian Echtler
(Bauhaus-Universität Weimar, Germany)
We present TouchScope, a hybrid multitouch interface for common off-the-shelf oscilloscopes. Oscilloscopes are a valuable tool for analyzing and debugging electronic circuits, but are also complex scientific instruments. Novices are faced with a seemingly overwhelming array of knobs and buttons, and usually require lengthy training before being able to use these devices productively.
In this paper, we present our implementation of TouchScope which uses a multitouch tablet in combination with an unmodified off-the-shelf oscilloscope to provide a novice-friendly hybrid interface, combining both the low entry barrier of a touch-based interface and the high degrees of freedom of a conventional button-based interface. Our evaluation with 29 inexperienced participants shows a comparable performance to traditional learning materials as well as a significantly higher level of perceived usability.

A Multimodal System to Characterise Melancholia: Cascaded Bag of Words Approach
Shalini Bhatia, Munawar Hayat, and Roland Goecke
(University of Canberra, Australia)
Recent years have seen a lot of activity in affective computing for automated analysis of depression. However, no research has so far proposed a multimodal system for classifying different subtypes of depression such as melancholia. The mental state assessment of a mood disorder depends primarily on appearance, behaviour, speech, thought, perception, mood and facial affect. Mood and facial affect mainly contribute to distinguishing melancholia from non-melancholia. These are assessed by clinicians, and hence vulnerable to subjective judgement. As a result, clinical assessment alone may not accurately capture the presence or absence of specific disorders such as melancholia, a distressing condition whose presence has important treatment implications. Melancholia is characterised by severe anhedonia and psychomotor disturbance, which can be a combination of motor retardation with periods of superimposed agitation. Psychomotor disturbance can be sensed in both face and voice. To the best of our knowledge, this study is the first attempt to propose a multimodal system to differentiate melancholia from non-melancholia and healthy controls. We report the sensitivity and specificity of classification in depressive subtypes.

Crowdsourcing Ratings of Caller Engagement in Thin-Slice Videos of Human-Machine Dialog: Benefits and Pitfalls
Vikram Ramanarayanan, Chee Wee Leong, David Suendermann-Oeft, and Keelan Evanini
(ETS at San Francisco, USA; ETS at Princeton, USA)
We analyze the efficacy of different crowds of naive human raters in rating engagement during human--machine dialog interactions. Each rater viewed multiple 10 second, thin-slice videos of native and non-native English speakers interacting with a computer-assisted language learning (CALL) system and rated how engaged and disengaged those callers were while interacting with the automated agent. We observe how the crowd's ratings compared to callers' self ratings of engagement, and further study how the distribution of these rating assignments vary as a function of whether the automated system or the caller was speaking. Finally, we discuss the potential applications and pitfalls of such crowdsourced paradigms in designing, developing and analyzing engagement-aware dialog systems.

Modelling Fusion of Modalities in Multimodal Interactive Systems with MMMM
Bruno Dumas

, Jonathan Pirau, and Denis Lalanne
(University of Namur, Belgium; University of Fribourg, Switzerland)
Several models and design spaces have been defined and are regularly used to describe how modalities can be fused together in an interactive multimodal system. However, models such as CASE, the CARE properties or TYCOON have all been defined more than two decades ago. In this paper, we start with a critical review of these models, which notably highlighted a confusion between how the user and the system side of a multimodal system were described. Based on this critical review, we define MMMM v1, an improved model for the description of multimodal fusion in interactive systems targeting completeness. A first user evaluation comparing the models revealed that MMMM v1 was indeed complete, but at the cost of user friendliness. Based on the results of this first evaluation, an improved version of MMMM, called MMMM v2 was defined. A second user evaluation highlighted that this model achieved a good balance between complexity, consistency and completeness compared to the state of the art.

Temporal Alignment using the Incremental Unit Framework
Casey Kennington, Ting Han, and David Schlangen
(Boise State University, USA; Bielefeld University, Germany)
We propose a method for temporal alignment--a precondition of meaningful fusion--of multimodal systems, using the incremental unit dialogue system framework, which gives the system flexibility in how it handles alignment: either by delaying a modality for a specified amount of time, or by revoking (i.e., backtracking) processed information so multiple information sources can be processed jointly. We evaluate our approach in an offline experiment with multimodal data and find that using the incremental framework is flexible and shows promise as a solution to the problem of temporal alignment in multimodal systems.

Multimodal Gender Detection
Mohamed Abouelenien, Verónica Pérez-Rosas, Rada Mihalcea, and Mihai Burzo
(University of Michigan, USA)
Automatic gender classification is receiving increasing attention in the computer interaction community as the need for personalized, reliable, and ethical systems arises. To date, most gender classification systems have been evaluated on textual and audiovisual sources. This work explores the possibility of enhancing such systems with physiological cues obtained from thermography and physiological sensor readings. Using a multimodal dataset consisting of audiovisual, thermal, and physiological recordings of males and females, we extract features from five different modalities, namely acoustic, linguistic, visual, thermal, and physiological. We then conduct a set of experiments where we explore the gender prediction task using single and combined modalities. Experimental results suggest that physiological and thermal information can be used to recognize gender at reasonable accuracy levels, which are comparable to the accuracy of current gender prediction systems. Furthermore, we show that the use of non-contact physiological measurements, such as thermography readings, can enhance current systems that are based on audio or visual input. This can be particularly useful for scenarios where non-contact approaches are preferred, i.e., when data is captured under noisy audiovisual conditions or when video or speech data are not available due to ethical considerations.

How May I Help You? Behavior and Impressions in Hospitality Service Encounters
Skanda Muralidhar, Marianne Schmid Mast, and Daniel Gatica-Perez
(Idiap, Switzerland; EPFL, Switzerland; University of Lausanne, Switzerland)
In the service industry, customers often assess quality of service based on the behavior, perceived personality, and other attributes of the front line service employees they interact with. Interpersonal communication during these interactions is key to determine customer satisfaction and perceived service quality. We present a computational framework to automatically infer perceived performance and skill variables of employees interacting with customers in a hotel reception desk setting using nonverbal behavior, studying a dataset of 169 dyadic interactions involving students from a hospitality management school. We also study the connections between impressions of Big-5 personality traits, attractiveness, and performance of receptionists. In regression tasks, our automatic framework achieves R²=0.30 for performance impressions using audio-visual nonverbal cues, compared to 0.35 using personality impressions, while attractiveness impressions had low predictive power. We also study the integration of nonverbal behavior and Big-5 personality impressions towards increasing regression performance (R² = 0.37).

Tracking Liking State in Brain Activity while Watching Multiple Movies
Naoto Terasawa, Hiroki Tanaka, Sakriani Sakti, and Satoshi Nakamura
(NAIST, Japan)
Emotion is a valuable information in various applications ranging from human-computer interaction to automated multimedia content delivery. Conventional methods to recognize emotion were based on speech prosody cues, facial expression, and body language. However, this information may not appear when people watch a movie. In recent years, some studies have started to use electroencephalogram (EEG) signals in recognizing emotion. But, the EEG data were entirely analyzed in each scene of movies for emotion classification. Thus, the detailed information of emotional state changes cannot be extracted. In this study, we utilize EEG to track affective state during watching multiple movies. Experiments were done by measuring continuous liking state during watching three types of movies, and then constructing subject dependent emotional state tracking model. We used support vector machine (SVM) as a classifier, and support vector regression (SVR) for regression. As a result, the best classification accuracy was 77.6%, and the best regression model achieved 0.645 of correlation coefficient between actual liking state and predicted liking state. These results demonstrate that continuous emotional state can be predicted by our EEG-based method.

Does Serial Memory of Locations Benefit from Spatially Congruent Audiovisual Stimuli? Investigating the Effect of Adding Spatial Sound to Visuospatial Sequences
Benjamin Stahl and Georgios Marentakis
(Graz University of Technology, Austria)
We present a study that investigates whether multimodal audiovisual presentation improves the recall of visuospatial sequences. In a modified Corsi block-tapping experiment, participants were asked to serially recall a spatial stimulus sequence in conditions that manipulated stimulus modality and sequence length. Adding spatial sound to the visuospatial sequences did not improve serial spatial recall performance. The results support the hypothesis that no modality-specific components in spatial working memory exist.

ZSGL: Zero Shot Gestural Learning
Naveen Madapana and Juan Wachs
(Purdue University, USA)
Gesture recognition systems enable humans to interact with machines in an intuitive and a natural way. Humans tend to create the gestures on the fly and conventional systems lack adaptability to learn new gestures beyond the training stage. This problem can be best addressed using Zero Shot Learning (ZSL), a paradigm in machine learning that aims to recognize unseen objects by just having a description of them. ZSL for gestures has hardly been addressed in computer vision research due to the inherent ambiguity and the contextual dependency associated with the gestures. This work proposes an approach for Zero Shot Gestural Learning (ZSGL) by leveraging the semantic information that is embedded in the gestures. First, a human factors based approach has been followed to generate semantic descriptors for gestures that can generalize to the existing gesture classes. Second, we assess the performance of various existing state-of-the-art algorithms on ZSL for gestures using two standard datasets: MSRC-12 and CGD2011 dataset. The obtained results (26.35% - unseen class accuracy) parallel the benchmark accuracies of attribute-based object recognition and justifies our claim that ZSL is a desirable paradigm for gesture based systems.

Markov Reward Models for Analyzing Group Interaction
Gabriel Murray
(University of the Fraser Valley, Canada)
In this work we introduce a novel application of Markov Reward models for studying group interaction. We describe a sample state representation for social sequences in meetings, and give examples of how particular states can be associated with immediate positive or negative rewards, based on outcomes of interest. We then present a Value Iteration algorithm for estimating the values of states. While we focus on two specific applications of Markov Reward models to small group interaction in meetings, there are many ways in which such a model can be used to study different facets of group dynamics and interaction. To encourage such research, we are making the Value Iteration software freely available.

Info

Analyzing First Impressions of Warmth and Competence from Observable Nonverbal Cues in Expert-Novice Interactions
Beatrice Biancardi, Angelo Cafaro, and Catherine Pelachaud
(CNRS, France; UPMC, France)
In this paper we present an analysis from a corpus of dyadic expert-novice knowledge sharing interactions. The analysis aims at investigating the relationship between observed non-verbal cues and first impressions formation of warmth and competence. We first obtained both discrete and continuous annotations of our data. Discrete descriptors include non-verbal cues such as type of gestures, arms rest poses, head movements and smiles. Continuous descriptors concern annotators' judgments of the expert's warmth and competence during the observed interaction with the novice. Then we computed Odds Ratios between those descriptors. Results highlight the role of smiling in warmth and competence impressions. Smiling is associated with increased levels of warmth and decreasing competence. It also affects the impact of others non-verbal cues (e.g. self-adaptors gestures) on warmth and competence. Moreover, our findings provide interesting insights about the role of rest poses, that are associated with decreased levels of warmth and competence impressions.

The NoXi Database: Multimodal Recordings of Mediated Novice-Expert Interactions
Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, and Michel Valstar
(CNRS, France; UPMC, France; University of Augsburg, Germany; University of Nottingham, UK)
We present a novel multi-lingual database of natural dyadic novice-expert interactions, named NoXi, featuring screen-mediated dyadic human interactions in the context of information exchange and retrieval. NoXi is designed to provide spontaneous interactions with emphasis on adaptive behaviors and unexpected situations (e.g. conversational interruptions). A rich set of audio-visual data, as well as continuous and discrete annotations are publicly available through a web interface. Descriptors include low level social signals (e.g. gestures, smiles), functional descriptors (e.g. turn-taking, dialogue acts) and interaction descriptors (e.g. engagement, interest, and fluidity).

Info

Head-Mounted Displays as Opera Glasses: Using Mixed-Reality to Deliver an Egalitarian User Experience during Live Events
Carl Bishop, Augusto Esteves, and Iain McGregor
(Edinburgh Napier University, UK)
This paper explores the use of head-mounted displays (HMDs) as a way to deliver a front row experience to any audience member during a live event. To do so, it presents a two-part user study that compares participants reported sense of presence across three experimental conditions: front row, back row, and back row with HMD (displaying 360° video captured live from the front row). Data was collected using the Temple Presence Inventory (TPI), which measures presence across eight factors. The reported sense of presence in the HMD condition was significantly higher in five of these measures, including spatial presence, social presence, passive social presence, active social presence, and social richness. We argue that the non-significant differences found in the other three factors – engagement, social realism, and perceptual realism – are artefacts of participants’ personal taste for the song being performed, or the effects of using a mixed-reality approach. Finally, the paper describes a basic system for low-latency, 360° video live streaming using off-the-shelf, affordable equipment and software.

Poster Session 2

Analyzing Gaze Behavior during Turn-Taking for Estimating Empathy Skill Level
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka
(NTT, Japan)
Techniques that use nonverbal behaviors to estimate communication skill in discussions have been receiving a lot of attention in recent research. In this study, we explored the gaze behavior towards the end of an utterance during turn-keeping/changing to estimate empathy skills in multiparty discussions. First, we collected data on Davis' Interpersonal Reactivity Index (IRI) (which measures empathy skill), utterances, and gaze behavior from participants in four-person discussions. The results of the analysis showed that the gaze behavior during turn-keeping/changing differs in accordance with people's empathy skill levels. The most noteworthy result is that the amount of a person's empathy skill is inversely proportional to the frequency of eye contact with the conversational partner during turn-keeping/changing. Specifically, if the current speaker has a high skill level, she often does not look at listener during turn-keeping and turn-changing. Moreover, when a person with a high skill level is the next speaker, she does not look at the speaker during turn-changing. In contrast, people who have a low skill level often continue to make eye contact with speakers and listeners. On the basis of these findings, we constructed and evaluated four models for estimating empathy skill levels. The evaluation results showed that the average absolute error of estimation is only 0.22 for the gaze transition pattern (GTP) model. This model uses the occurrence probability of GTPs when the person is a speaker and listener during turn-keeping and speaker, next-speaker, and listener during turn-changing. It outperformed the models that used the amount of utterances and duration of gazes. This suggests that the GTP during turn-keeping and turn-changing is effective for estimating an individual's empathy skills in multi-party discussions.

Text Based User Comments as a Signal for Automatic Language Identification of Online Videos
A. Seza Doğruöz, Natalia Ponomareva, Sertan Girgin, Reshu Jain, and Christoph Oehler
(Xoogler, Turkey; Google, USA; Google, France; Google, Switzerland)
Identifying the audio language of online videos is crucial for industrial multi-media applications. Automatic speech recognition systems can potentially detect the language of the audio. However, such systems are not available for all languages. Moreover, background noise, music and multi-party conversations make audio language identification hard. Instead, we utilize text based user comments as a new signal to identify audio language of YouTube videos. First, we detect the language of the text based comments. Augmenting this information with video meta-data features, we predict the language of the videos with an accuracy of 97% on a set of publicly available videos. The subject matter discussed in this research is patent pending.

Gender and Emotion Recognition with Implicit User Signals
Maneesh Bilalpur, Seyed Mostafa Kia, Manisha Chawla, Tat-Seng Chua, and Ramanathan Subramanian
(IIIT Hyderabad, India; Radboud University, Netherlands; IIT Gandhinagar, India; National University of Singapore, Singapore; University of Glasgow, UK; Advanced Digital Sciences Center, Singapore)
We examine the utility of implicit user behavioral signals captured using low-cost, off-the-shelf devices for anonymous gender and emotion recognition. A user study designed to examine male and female sensitivity to facial emotions confirms that females recognize (especially negative) emotions quicker and more accurately than men, mirroring prior findings. Implicit viewer responses in the form of EEG brain signals and eye movements are then examined for existence of (a) emotion and gender-specific patterns from event-related potentials (ERPs) and fixation distributions and (b) emotion and gender discriminability. Experiments reveal that (i) Gender and emotion-specific differences are observable from ERPs, (ii) multiple similarities exist between explicit responses gathered from users and their implicit behavioral signals, and (iii) Significantly above-chance (≈70%) gender recognition is achievable on comparing emotion-specific EEG responses– gender differences are encoded best for anger and disgust. Also, fairly modest valence (positive vs negative emotion) recognition is achieved with EEG and eye-based features.

Animating the Adelino Robot with ERIK: The Expressive Robotics Inverse Kinematics
Tiago Ribeiro and Ana Paiva
(INESC-ID, Portugal; University of Lisbon, Portugal)
This paper presents ERIK, an inverse kinematics technique for robot animation that is able to control a real expressive manipulator-like robot in real-time, by simultaneously solving for both the robot's full body expressive posture, and orientation of the end-effector (head). Our solution, meant for generic autonomous social robots, was designed to work out of the box on any kinematic chain, and achieved by blending forward kinematics (for posture control) and inverse kinematics (for the gaze-tracking/orientation constraint). Results from a user study show that the resulting expressive motion is still able to convey an expressive intention to the users.

Video

Automatic Detection of Pain from Spontaneous Facial Expressions
Fatma Meawad, Su-Yin Yang, and Fong Ling Loy
(University of Glasgow, UK; Tan Tock Seng Hospital, Singapore)
This paper presents a new approach for detecting pain in sequences of spontaneous facial expressions. The motivation for this work is to accompany mobile-based self-management of chronic pain as a virtual sensor for tracking patients' expressions in real-world settings. Operating under such constraints requires a resource efficient approach for processing non-posed facial expressions from unprocessed temporal data. In this work, the facial action units of pain are modeled as sets of distances among related facial landmarks. Using standardized measurements of pain versus no-pain that are specific to each user, changes in the extracted features in relation to pain are detected. The activated features in each frame are combined using an adapted form of the Prkachin and Solomon Pain Intensity scale (PSPI) to detect the presence of pain per frame. Painful features must be activated in N consequent frames (time window) to indicate the presence of pain in a session. The discussed method was tested on 171 video sessions for 19 subjects from the McMaster painful dataset for spontaneous facial expressions. The results show higher precision than coverage in detecting sequences of pain. Our algorithm achieves 94% precision (F-score=0.82) against human observed labels, 74% precision (F-score=0.62) against automatically generated pain intensities and 100% precision (F-score=0.67) against self-reported pain intensities.

Evaluating Content-Centric vs. User-Centric Ad Affect Recognition
Abhinav Shukla, Shruti Shriya Gullapuram, Harish Katti, Karthik Yadati, Mohan Kankanhalli, and Ramanathan Subramanian
(IIIT Hyderabad, India; Indian Institute of Science, India; Delft University of Technology, Netherlands; National University of Singapore, Singapore; University of Glasgow at Singapore, Singapore)
Despite the fact that advertisements (ads) often include strongly emotional content, very little work has been devoted to affect recognition (AR) from ads. This work explicitly compares content-centric and user-centric ad AR methodologies, and evaluates the impact of enhanced AR on computational advertising via a user study. Specifically, we (1) compile an affective ad dataset capable of evoking coherent emotions across users; (2) explore the efficacy of content-centric convolutional neural network (CNN) features for encoding emotions, and show that CNN features outperform low-level emotion descriptors; (3) examine user-centered ad AR by analyzing Electroencephalogram (EEG) responses acquired from eleven viewers, and find that EEG signals encode emotional information better than content descriptors; (4) investigate the relationship between objective AR and subjective viewer experience while watching an ad-embedded online video stream based on a study involving 12 users. To our knowledge, this is the first work to (a) expressly compare user vs content-centered AR for ads, and (b) study the relationship between modeling of ad emotions and its impact on a real-life advertising application.

A Domain Adaptation Approach to Improve Speaker Turn Embedding using Face Representation
Nam Le and Jean-Marc Odobez
(Idiap, Switzerland)
This paper proposes a novel approach to improve speaker modeling using knowledge transferred from face representation. In particular, we are interested in learning a discriminative metric which allows speaker turns to be compared directly, which is beneficial for tasks such as diarization and dialogue analysis. Our method improves the embedding space of speaker turns by applying maximum mean discrepancy loss to minimize the disparity between the distributions of facial and acoustic embedded features. This approach aims to discover the shared underlying structure of the two embedded spaces, thus enabling the transfer of knowledge from the richer face representation to the counterpart in speech. Experiments are conducted on broadcast TV news datasets, REPERE and ETAPE, to demonstrate the validity of our method. Quantitative results in verification and clustering tasks show promising improvement, especially in cases where speaker turns are short or the training data size is limited.

Computer Vision Based Fall Detection by a Convolutional Neural Network
Miao Yu, Liyun Gong, and Stefanos Kollias
(University of Lincoln, UK)
In this work, we propose a novel computer vision based fall detection system, which could be applied for the health-care of the elderly people community. For a recorded video stream, background subtraction is firstly applied to extract the human body silhouette. Extracted silhouettes corresponding to daily activities are applied to construct a convolutional neural network, which is applied for classification of different classes of human postures (e.g., bend, stand, lie and sit) and detection of a fall event (i.e., lying posture is detected in the floor region). As far as we know, this work is the first attempt for the application of the convolutional neural network for the fall detection application. From a dataset of daily activities recorded from multiple people, we show that the proposed method both achieves higher postures classification results than the state-of-the-art classifiers and can successfully detect the fall event with a low false alarm rate.

Predicting Meeting Extracts in Group Discussions using Multimodal Convolutional Neural Networks
Fumio Nihei, Yukiko I. Nakano, and Yutaka Takase
(Seikei University, Japan)
This study proposes the use of multimodal fusion models employing Convolutional Neural Networks (CNNs) to extract meeting minutes from group discussion corpus. First, unimodal models are created using raw behavioral data such as speech, head motion, and face tracking. These models are then integrated into a fusion model that works as a classifier. The main advantage of this work is that the proposed models were trained without any hand-crafted features, and they outperformed a baseline model that was trained using hand-crafted features. It was also found that multimodal fusion is useful in applying the CNN approach to model multimodal multiparty interaction.

The Relationship between Task-Induced Stress, Vocal Changes, and Physiological State during a Dyadic Team Task
Catherine Neubauer, Mathieu Chollet, Sharon Mozgai, Mark Dennison, Peter Khooshabeh, and Stefan Scherer
(Army Research Lab at Playa Vista, USA; University of Southern California, USA)
It is commonly known that a relationship exists between the human voice and various emotional states. Past studies have demonstrated changes in a number of vocal features, such as fundamental frequency f0 and peakSlope, as a result of varying emotional state. These voice characteristics have been shown to relate to emotional load, vocal tension, and, in particular, stress. Although much research exists in the domain of voice analysis, few studies have assessed the relationship between stress and changes in the voice during a dyadic team interaction. The aim of the present study was to investigate the multimodal interplay between speech and physiology during a high-workload, high-stress team task. Specifically, we studied task-induced effects on participants' vocal signals, specifically, the f0 and peakSlope features, as well as participants' physiology, through cardiovascular measures. Further, we assessed the relationship between physiological states related to stress and changes in the speaker's voice. We recruited participants with the specific goal of working together to diffuse a simulated bomb. Half of our sample participated in an "Ice Breaker" scenario, during which they were allowed to converse and familiarize themselves with their teammate prior to the task, while the other half of the sample served as our "Control". Fundamental frequency (f0), peakSlope, physiological state, and subjective stress were measured during the task. Results indicated that f0 and peakSlope significantly increased from the beginning to the end of each task trial, and were highest in the last trial, which indicates an increase in emotional load and vocal tension. Finally, cardiovascular measures of stress indicated that the vocal and emotional load of speakers towards the end of the task mirrored a physiological state of psychological "threat".

Meyendtris: A Hands-Free, Multimodal Tetris Clone using Eye Tracking and Passive BCI for Intuitive Neuroadaptive Gaming
Laurens R. Krol, Sarah-Christin Freytag, and Thorsten O. Zander
(TU Berlin, Germany)
This paper introduces a completely hands-free version of Tetris that uses eye tracking and passive brain-computer interfacing (a real-time measurement and interpretation of brain activity) to replace existing game elements, as well as introduce novel ones. In Meyendtris, dwell time-based eye tracking replaces the game's direct control elements, i.e. the movement of the tetromino. In addition to that, two mental states of the player influence the game in real time by means of passive brain-computer interfacing. First, a measure of the player's relaxation is used to modulate the speed of the game (and the corresponding music). Second, when upon landing of a tetromino a state of error perception is detected in the player's brain, this last landed tetromino is destroyed. Together, this results in a multimodal, hands-free version of the classic Tetris game that is no longer hindered by manual input bottlenecks, while engaging novel mental abilities of the player.

AMHUSE: A Multimodal dataset for HUmour SEnsing
Giuseppe Boccignone, Donatello Conte, Vittorio Cuculo, and Raffaella Lanzarotti
(University of Milan, Italy; University of Tours, France)
We present AMHUSE (A Multimodal dataset for HUmour SEnsing) along with a novel web-based annotation tool named DANTE (Dimensional ANnotation Tool for Emotions). The dataset is the result of an experiment concerning amusement elicitation, involving 36 subjects in order to record the reactions in presence of 3 amusing and 1 neutral video stimuli. Gathered data include RGB video and depth sequences along with physiological responses (electrodermal activity, blood volume pulse, temperature). The videos were later annotated by 4 experts in terms of valence and arousal continuous dimensions. Both the dataset and the annotation tool are made publicly available for research purposes.

GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication
Mohamed Khamis, Mariam Hassib, Emanuel von Zezschwitz, Andreas Bulling, and Florian Alt
(LMU Munich, Germany; Max Planck Institute for Informatics, Germany)
Although mobile devices provide access to a plethora of sensitive data, most users still only protect them with PINs or patterns, which are vulnerable to side-channel attacks (e.g., shoulder surfing). How-ever, prior research has shown that privacy-aware users are willing to take further steps to protect their private data. We propose GazeTouchPIN, a novel secure authentication scheme for mobile devices that combines gaze and touch input. Our multimodal approach complicates shoulder-surfing attacks by requiring attackers to ob-serve the screen as well as the user’s eyes to and the password. We evaluate the security and usability of GazeTouchPIN in two user studies (N=30). We found that while GazeTouchPIN requires longer entry times, privacy aware users would use it on-demand when feeling observed or when accessing sensitive data. The results show that successful shoulder surfing attack rate drops from 68% to 10.4%when using GazeTouchPIN.

Video

Multi-task Learning of Social Psychology Assessments and Nonverbal Features for Automatic Leadership Identification
Cigdem Beyan, Francesca Capozzi, Cristina Becchio, and Vittorio Murino
(IIT Genoa, Italy; McGill University, Canada; University of Turin, Italy; University of Verona, Italy)
In social psychology, the leadership investigation is performed using questionnaires which are either i) self-administered or ii) applied to group participants to evaluate other members or iii) filled by external observers. While each of these sources is informative, using them individually might not be as effective as using them jointly. This paper is the first attempt which addresses the automatic identification of leaders in small-group meetings, by learning effective models using nonverbal audio-visual features and the results of social psychology questionnaires that reflect assessments regarding leadership. Learning is based on Multi-Task Learning which is performed without using ground-truth data (GT), but using the results of questionnaires (having substantial agreement with GT), administered to external observers and the participants of the meetings, as tasks. The results show that joint learning results in better performance as compared to single task learning and other baselines.

Multimodal Analysis of Vocal Collaborative Search: A Public Corpus and Results
Daniel McDuff, Paul Thomas, Mary Czerwinski, and Nick Craswell
(Microsoft Research, USA; Microsoft Research, Australia)
Intelligent agents have the potential to help with many tasks. Information seeking and voice-enabled search assistants are becoming very common. However, there remain questions as to the extent by which these agents should sense and respond to emotional signals. We designed a set of information seeking tasks and recruited participants to complete them using a human intermediary. In total we collected data from 22 pairs of individuals, each completing five search tasks. The participants could communicate only using voice, over a VoIP service. Using automated methods we extracted facial action, voice prosody and linguistic features from the audio-visual recordings. We analyzed the characteristics of these interactions that correlated with successful communication and understanding between the pairs. We found that those who were expressive in channels that were missing from the communication channel (e.g., facial actions and gaze) were rated as communicating poorly, being less helpful and understanding. Having a way of reinstating nonverbal cues into these interactions would improve the experience, even when the tasks are purely information seeking exercises. The dataset used for this analysis contains over 15 hours of video, audio and transcripts and reported ratings. It is publicly available for researchers at: http://aka.ms/MISCv1.

UE-HRI: A New Dataset for the Study of User Engagement in Spontaneous Human-Robot Interactions
Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, and Angelica Lim
(Telecom ParisTech, France; University of Paris-Saclay, France; SoftBank Robotics, France)
In this paper, we present a new dataset of spontaneous interactions between a robot and humans, of which 54 interactions (between 4 and 15-minute duration each) are freely available for download and use. Participants were recorded while holding spontaneous conversations with the robot Pepper. The conversations started automatically when the robot detected the presence of a participant and kept the recording if he/she accepted the agreement (i.e. to be recorded). Pepper was in a public space where the participants were free to start and end the interaction when they wished. The dataset provides rich streams of data that could be used by research and development groups in a variety of areas.

Info

Mining a Multimodal Corpus of Doctor’s Training for Virtual Patient’s Feedbacks
Chris Porhet, Magalie Ochs, Jorane Saubesty, Grégoire de Montcheuil, and Roxane Bertrand
(Aix-Marseille University, France; CNRS, France; ENSAM, France; University of Toulon, France)
Doctors should be trained not only to perform medical or surgical acts but also to develop competences in communication for their interaction with patients. For instance, the way doctors deliver bad news has a significant impact on the therapeutic process. In order to facilitate the doctors’ training to break bad news, we aim at developing a virtual patient ables to interact in a multimodal way with doctors announcing an undesirable event. One of the key elements to create an engaging interaction is the feedbacks’ behavior of the virtual character. In order to model the virtual patient’s feedbacks in the context of breaking bad news, we have analyzed a corpus of real doctor’s training. The verbal and nonverbal signals of both the doctors and the patients have been annotated. In order to identify the types of feedbacks and the elements that may elicit a feedback, we have explored the corpus based on sequences mining methods. Rules, that have been extracted from the corpus, enable us to determine when a virtual patient should express which feedbacks when a doctor announces a bad new

Info

Multimodal Affect Recognition in an Interactive Gaming Environment using Eye Tracking and Speech Signals
Ashwaq Alhargan, Neil Cooke, and Tareq Binjammaz
(University of Birmingham, UK; De Montfort University, UK)
This paper presents a multimodal affect recognition system for interactive virtual gaming environments using eye tracking and speech signals, captured in gameplay scenarios that are designed to elicit controlled affective states based on the arousal and valence dimensions. The Support Vector Machine is employed as a classifier to detect these affective states from both modalities. The recognition results reveal that eye tracking is superior to speech in affect detection and that the two modalities are complementary in the application of interactive gaming. This suggests that it is feasible to design an accurate multimodal recognition system to detect players’ affects from the eye tracking and speech modalities in the interactive gaming environment. We emphasise the potential of integrating the proposed multimodal system into game interfaces to enhance interaction and provide an adaptive gaming experience.

ICMI 2017 – Proceedings

Main Track

Oral Session 1: Children and Interaction

Oral Session 2: Understanding Human Behaviour

Oral Session 3: Touch and Gesture

Oral Session 4: Sound and Interaction

Oral Session 5: Methodology

Oral Session 6: Artificial Agents and Wearable Sensors

Poster Session 1

Poster Session 2