ICMI 2016 – Proceedings

Trust Me: Multimodal Signals of Trustworthiness
Gale Lucas, Giota Stratou, Shari Lieblich, and Jonathan Gratch
(University of Southern California, USA; Temple College, USA)
This paper builds on prior psychological studies that identify signals of trustworthiness between two human negotiators. Unlike prior work, the current work tracks such signals automatically and fuses them into computational models that predict trustworthiness. To achieve this goal, we apply automatic trackers to recordings of human dyads negotiating in a multi-issue bargaining task. We identify behavioral indicators in different modalities (facial expressions, gestures, gaze, and conversational features) that are predictive of trustworthiness. We predict both objective trustworthiness (i.e., are they honest) and perceived trustworthiness (i.e., do they seem honest to their interaction partner). Our experiments show that people are poor judges of objective trustworthiness (i.e., objective and perceived trustworthiness are predicted by different indicators), and that multimodal approaches better predict objective trustworthiness, whereas people overly rely on facial expressions when judging the honesty of their partner. Moreover, domain knowledge (from the literature and prior analysis of behaviors) facilitates the model development process.

Semi-situated Learning of Verbal and Nonverbal Content for Repeated Human-Robot Interaction
Iolanda Leite, André Pereira, Allison Funkhouser, Boyang Li, and Jill Fain Lehman
(Disney Research, USA)
Content authoring of verbal and nonverbal behavior is a limiting factor when developing agents for repeated social interactions with the same user. We present PIP, an agent that crowdsources its own multimodal language behavior using a method we call semi-situated learning. PIP renders segments of its goal graph into brief stories that describe future situations, sends the stories to crowd workers who author and edit a single line of character dialog and its manner of expression, integrates the results into its goal state representation, and then uses the authored lines at similar moments in conversation. We present an initial case study in which the language needed to host a trivia game interaction is learned pre-deployment and tested in an autonomous system with 200 users "in the wild." The interaction data suggests that the method generates both meaningful content and variety of expression.

Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Audio-Visual Feedback Tokens
Catharine Oertel, José Lopes, Yu Yu, Kenneth A. Funes Mora, Joakim Gustafson, Alan W. Black, and Jean-Marc Odobez
(KTH, Sweden; EPFL, Switzerland; Idiap, Switzerland; Carnegie Mellon University, USA)
Current dialogue systems typically lack a variation of audio-visual feedback tokens. Either they do not encompass feedback tokens at all, or only support a limited set of stereotypical functions. However, this does not mirror the subtleties of spontaneous conversations. If we want to be able to build an artificial listener, as a first step towards building an empathetic artificial agent, we also need to be able to synthesize more subtle audio-visual feedback tokens. In this study, we devised an array of monomodal and multimodal binary comparison perception tests and experiments to understand how different realisations of verbal and visual feedback tokens influence third-party perception of the degree of attentiveness. This allowed us to investigate i) which features (amplitude, frequency, duration...) of the visual feedback influences attentiveness perception; ii) whether visual or verbal backchannels are perceived to be more attentive iii) whether the fusion of unimodal tokens with low perceived attentiveness increases the degree of perceived attentiveness compared to unimodal tokens with high perceived attentiveness taken alone; iv) the automatic ranking of audio-visual feedback token in terms of conveyed degree of attentiveness.

Sequence-Based Multimodal Behavior Modeling for Social Agents
Soumia Dermouche and Catherine Pelachaud
(CNRS, France; Telecom ParisTech, France)
The goal of this work is to model a virtual character able to converse with different interpersonal attitudes. To build our model, we rely on the analysis of multimodal corpora of non-verbal behaviors. The interpretation of these behaviors depends on how they are sequenced (order) and distributed over time. To encompass the dynamics of non-verbal signals across both modalities and time, we make use of temporal sequence mining. Specifically, we propose a new algorithm for temporal sequence extraction. We apply our algorithm to extract temporal patterns of non-verbal behaviors expressing interpersonal attitudes from a corpus of job interviews. We demonstrate the efficiency of our algorithm in terms of significant accuracy improvement over the state-of-the-art algorithms.

Oral Session 2: Physiological and Tactile Modalities
Sun, Nov 13, 14:00 - 15:30, Miraikan Hall (Chair: Jonathan Gratch (University of Southern California))

Adaptive Review for Mobile MOOC Learning via Implicit Physiological Signal Sensing
Phuong Pham and Jingtao Wang
(University of Pittsburgh, USA)
Massive Open Online Courses (MOOCs) have the potential to enable high quality knowledge dissemination in large scale at low cost. However, today's MOOCs also suffer from low engagement, uni-directional information flow, and lack of personalization. In this paper, we propose AttentiveReview, an effective intervention technology for mobile MOOC learning. AttentiveReview infers a learner's perceived difficulty levels of the corresponding learning materials via implicit photoplethysmography (PPG) sensing on unmodified smartphones. AttentiveReview also recommends personalized review sessions through a user-independent model. In a 32-participant user study, we found that: 1) AttentiveReview significantly improved information recall (+14.6%) and learning gain (+17.4%) when compared with the no review condition; 2) AttentiveReview also achieved comparable performances at significantly less time when compared with the full review condition; 3) As an end-to-end mobile tutoring system, the benefits of AttentiveReview outweigh side-effects from false positives and false negatives. Overall, we show that it is feasible to improve mobile MOOC learning by recommending review materials adaptively from rich but noisy physiological signals.

Visuotactile Integration for Depth Perception in Augmented Reality
Nina Rosa, Wolfgang Hürst, Peter Werkhoven, and Remco Veltkamp

(Utrecht University, Netherlands)
Augmented reality applications using stereo head-mounted displays are not capable of perfectly blending real and virtual objects. For example, depth in the real world is perceived through cues such as accommodation and vergence. However, in stereo head-mounted displays these cues are disconnected since the virtual is generally projected at a static distance, while vergence changes with depth. This conflict can result in biased depth estimation of virtual objects in a real environment. In this research, we examined whether redundant tactile feedback can reduce the bias in perceived depth in a reaching task. In particular, our experiments proved that a tactile mapping of distance to vibration intensity or vibration position on the skin can be used to determine a virtual object's depth. Depth estimation when using only tactile feedback was more accurate than when using only visual feedback, and when using visuotactile feedback it was more precise and occurred faster than when using unimodal feedback. Our work demonstrates the value of multimodal feedback in augmented reality applications that require correct depth perception, and provides insights on various possible visuotactile implementations.

Exploring Multimodal Biosignal Features for Stress Detection during Indoor Mobility
Kyriaki Kalimeri and Charalampos Saitis
(ISI Foundation, Italy; TU Berlin, Germany)
This paper presents a multimodal framework for assessing the emotional and cognitive experience of blind and visually impaired people when navigating in unfamiliar indoor environments based on mobile monitoring and fusion of electroencephalography (EEG) and electrodermal activity (EDA) signals. The overall goal is to understand which environmental factors increase stress and cognitive load in order to help design emotionally intelligent mobility technologies that are able to adapt to stressful environments from real-time biosensor data. We propose a model based on a random forest classifier which successfully infers in an automatic way (weighted AUROC 79.3%) the correct environment among five predefined categories expressing generic everyday situations of varying complexity and difficulty, where different levels of stress are likely to occur. Time-locating the most predictive multimodal features that relate to cognitive load and stress, we provide further insights into the relationship of specific biomarkers with the environmental/situational factors that evoked them.

An IDE for Multimodal Controls in Smart Buildings
Sebastian Peters, Jan Ole Johanssen, and Bernd Bruegge
(TU Munich, Germany)
Smart buildings have become an essential part of our daily life. Due to the increased availability of addressable fixtures and controls, the operation of building components has become complex for occupants, especially when multimodal controls are involved. Thus, occupants are confronted with the need to manage these controls in a user-friendly way. We propose the MIBO IDE, an integrated development environment for defining, connecting, and managing multimodal controls designed with end-users in mind. The MIBO IDE provides a convenient way for occupants to create and customize multimodal interaction models while taking the occupants' preferences, culture, and potential physical limitations into account. The MIBO IDE allows occupants to apply and utilize various components such as the MIBO Editor. This core component can be used to define interaction models visually using the drag-and-drop metaphor and therefore without requiring programming skills. In addition, the MIBO IDE provides components for debugging and compiling interaction models as well as detecting conflicts among them.

Info

Poster Session 1
Sun, Nov 13, 16:00 - 18:00, Conference Room 3

Personalized Unknown Word Detection in Non-native Language Reading using Eye Gaze
Rui Hiraoka, Hiroki Tanaka, Sakriani Sakti, Graham Neubig, and Satoshi Nakamura
(Nara Institute of Science and Technology, Japan)
This paper proposes a method to detect unknown words during natural reading of non-native language text by using eye-tracking features. A previous approach utilizes gaze duration and word rarity features to perform this detection. However, while this system can be used by trained users, its performance is not sufficient during natural reading by untrained users. In this paper, we 1) apply support vector machines (SVM) with novel eye movement features that were not considered in the previous work and 2) examine the effect of personalization. The experimental results demonstrate that learning using SVMs and proposed eye movement features improves detection performance as measured by F-measure and that personalization further improves results.

Discovering Facial Expressions for States of Amused, Persuaded, Informed, Sentimental and Inspired
Daniel McDuff
(Affectiva, USA; Microsoft Research, USA)
Facial expressions play a significant role in everyday interactions. A majority of the research on facial expressions of emotion has focused on a small set of "basic" states. However, in real-life the expression of emotions is highly context dependent and prototypic expressions of "basic" emotions may not always be present. In this paper we attempt to discover expressions associated with alternate states of informed, inspired, persuaded, sentimental and amused based on a very large dataset of observed facial responses. We used a curated set of 395 everyday videos that were found to reliably elicit the states and recorded 49,869 facial responses as viewers watched the videos in their homes. Using automated facial coding we quantified the presence of 18 facial actions in each of the 23.4 million frames. Lip corner pulls, lip sucks and inner brow raises were prominent in sentimental responses. Outer brow raises and eye widening were prominent in persuaded and informed responses. More brow furrowing distinguished informed from persuaded responses potentially indicating higher cognition.

Do Speech Features for Detecting Cognitive Load Depend on Specific Languages?
Rui Chen, Tiantian Xie, Yingtao Xie, Tao Lin, and Ningjiu Tang
(Sichuan University, China; University of Maryland in Baltimore County, USA)
Speech-based cognitive load modeling recently proposed in English have enabled objective, quantitative and unobtrusive evaluation of cognitive load without extra equipment. However, no evidence indicates that these techniques could be applied to speech data in other languages without modification. In this study, a modified Stroop Test and a Reading Span Task were conducted to collect speech data in English and Chinese respectively, from which twenty non-linguistic features were extracted to investigate whether they were language dependent. Some discriminating speech features were observed language dependent, which serves as an evidence that there is a necessity to adapt speech-based cognitive load detection techniques to diverse language contexts for a higher performance.

Training on the Job: Behavioral Analysis of Job Interviews in Hospitality
Skanda Muralidhar, Laurent Son Nguyen, Denise Frauendorfer, Jean-Marc Odobez, Marianne Schmid Mast, and Daniel Gatica-Perez
(Idiap, Switzerland; EPFL, Switzerland; University of Lausanne, Switzerland)
First impressions play a critical role in the hospitality industry and have been shown to be closely linked to the behavior of the person being judged.In this work, we implemented a behavioral training framework for hospitality students with the goal of improving the impressions that other people make about them. We outline the challenges associated with designing such a framework and embedding it in the everyday practice of a real hospitality school. We collected a dataset of 169 laboratory sessions where two role-plays were conducted, job interviews and reception desk scenarios, for a total of 338 interactions. For job interviews, we evaluated the relationship between automatically extracted nonverbal cues and various perceived social variables in a correlation analysis. Furthermore, our system automatically predicted first impressions from job interviews in a regression task, and was able to explain up to 32% of the variance, thus extending the results in existing literature, and showing gender differences, corroborating previous findings in psychology. This work constitutes a step towards applying social sensing technologies to the real world by designing and implementing a living lab for students of an international hospitality management school.

Emotion Spotting: Discovering Regions of Evidence in Audio-Visual Emotion Expressions
Yelin Kim and Emily Mower Provost
(SUNY Albany, USA; University of Michigan, USA)
Research has demonstrated that humans require different amounts of information, over time, to accurately perceive emotion expressions. This varies as a function of emotion classes. For example, recognition of happiness requires a longer stimulus than recognition of anger. However, previous automatic emotion recognition systems have often overlooked these differences. In this work, we propose a data-driven framework to explore patterns (timings and durations) of emotion evidence, specific to individual emotion classes. Further, we demonstrate that these patterns vary as a function of which modality (lower face, upper face, or speech) is examined, and consistent patterns emerge across different folds of experiments. We also show similar patterns across emotional corpora (IEMOCAP and MSP-IMPROV). In addition, we show that our proposed method, which uses only a portion of the data (59% for the IEMOCAP), achieves comparable accuracy to a system that uses all of the data within each utterance. Our method has a higher accuracy when compared to a baseline method that randomly chooses a portion of the data. We show that the performance gain of the method is mostly from prototypical emotion expressions (defined as expressions with rater consensus). The innovation in this study comes from its understanding of how multimodal cues reveal emotion over time.

Semi-supervised Model Personalization for Improved Detection of Learner's Emotional Engagement
Nese Alyuz, Eda Okur, Ece Oktay, Utku Genc, Sinem Aslan, Sinem Emine Mete, Bert Arnrich, and Asli Arslan Esme
(Intel, Turkey; Bogazici University, Turkey)
Affective states play a crucial role in learning. Existing Intelligent Tutoring Systems (ITSs) fail to track affective states of learners accurately. Without an accurate detection of such states, ITSs are limited in providing truly personalized learning experience. In our longitudinal research, we have been working towards developing an empathic autonomous 'tutor' closely monitoring students in real-time using multiple sources of data to understand their affective states corresponding to emotional engagement. We focus on detecting learning related states (i.e., 'Satisfied', 'Bored', and 'Confused'). We have collected 210 hours of data through authentic classroom pilots of 17 sessions. We collected information from two modalities: (1) appearance, which is collected from the camera, and (2) context-performance, that is derived from the content platform. The learning content of the content platform consists of two section types: (1) instructional where students watch instructional videos and (2) assessment where students solve exercise questions. Since there are individual differences in expressing affective states, the detection of emotional engagement needs to be customized for each individual. In this paper, we propose a hierarchical semi-supervised model adaptation method to achieve highly accurate emotional engagement detectors. In the initial calibration phase, a personalized context-performance classifier is obtained. In the online usage phase, the appearance classifier is automatically personalized using the labels generated by the context-performance model. The experimental results show that personalization enables performance improvement of our generic emotional engagement detectors. The proposed semi-supervised hierarchical personalization method result in 89.23% and 75.20% F1 measures for the instructional and assessment sections respectively.

Driving Maneuver Prediction using Car Sensor and Driver Physiological Signals
Nanxiang Li, Teruhisa Misu, Ashish Tawari, Alexandre Miranda, Chihiro Suga, and Kikuo Fujimura
(Honda Research Institute, USA)
This study presents the preliminary attempt to investigate the usage of driver physiology signals, including electrocardiography (ECG) and respiration wave signals, to predict driving maneuvers. While most studies on driving maneuver prediction uses direct measurements from vehicle or road scene, we believe the mental state changes from the driver when making plans for maneuver can be reflected from the physiological signals. We extract both time and frequency domain features from the physiological signals, and use them as the features to predict the drivers' future maneuver. We formulate the prediction of driver maneuver as a multi-class classification problem by using the features extracted from signal before the driving maneuvers. The multi classes correspond to various types of driving maneuvers including Start, Stop, Lane Switch and Turn. We use the support vector machine (SVM) as the classifier, and compare the performance of using both physiological and car signals (CAN bus) with the baseline classifier that is trained with only car signal. An improved performance is observed when using the physiological features with 0.04 in F-score on average. This improvement is more obvious as the prediction is made earlier.

On Leveraging Crowdsourced Data for Automatic Perceived Stress Detection
Jonathan Aigrain, Arnaud Dapogny, Kévin Bailly, Séverine Dubuisson, Marcin Detyniecki, and Mohamed Chetouani
(UPMC, France; CNRS, France; LIP6, France; Polish Academy of Sciences, Poland)
Resorting to crowdsourcing platforms is a popular way to obtain annotations. Multiple potentially noisy answers can thus be aggregated to retrieve an underlying ground truth. However, it may be irrelevant to look for a unique ground truth when we ask crowd workers for opinions, notably when dealing with subjective phenomena such as stress. In this paper, we discuss how we can better use crowdsourced annotations with an application to automatic detection of perceived stress. Towards this aim, we first acquired video data from 44 subjects in a stressful situation and gathered answers to a binary question using a crowdsourcing platform. Then, we propose to integrate two measures derived from the set of gathered answers into the machine learning framework. First, we highlight that using the consensus level among crowd worker answers substantially increases classification accuracies. Then, we show that it is suitable to directly predict for each video the proportion of positive answers to the question from the different crowd workers. Hence, we propose a thorough study on how crowdsourced annotations can be used to enhance performance of classification and regression methods.

Investigating the Impact of Automated Transcripts on Non-native Speakers' Listening Comprehension
Xun Cao, Naomi Yamashita, and Toru Ishida
(Kyoto University, Japan; NTT, Japan)
Real-time transcripts generated by automatic speech recognition (ASR) technologies hold potential to facilitate non-native speakers’ (NNSs) listening comprehension. While introducing another modality (i.e., ASR transcripts) to NNSs provides supplemental information to understand speech, it also runs the risk of overwhelming them with excessive information. The aim of this paper is to understand the advantages and disadvantages of presenting ASR transcripts to NNSs and to study how such transcripts affect listening experiences. To explore these issues, we conducted a laboratory experiment with 20 NNSs who engaged in two listening tasks in different conditions: audio only and audio+ASR transcripts. In each condition, the participants described the comprehension problems they encountered while listening. From the analysis, we found that ASR transcripts helped NNSs solve certain problems (e.g., “do not recognize words they know”), but imperfect ASR transcripts (e.g., errors and no punctuation) sometimes confused them and even generated new problems. Furthermore, post-task interviews and gaze analysis of the participants revealed that NNSs did not have enough time to fully exploit the transcripts. For example, NNSs had difficulty shifting between multimodal contents. Based on our findings, we discuss the implications for designing better multimodal interfaces for NNSs.

Speaker Impact on Audience Comprehension for Academic Presentations
Keith Curtis, Gareth J. F. Jones, and Nick Campbell
(Dublin City University, Ireland; Trinity College Dublin, Ireland)
Understanding audience comprehension levels for presentations has the potential to enable richer and more focused interaction with audio-visual recordings. We describe an investigation into automated analysis and classification of audience comprehension during academic presentations. We identify audio and visual features considered to most aid audience understanding. To obtain gold standards for comprehension levels, human annotators watched contiguous video segments from a corpus of academic presentations and estimated how much they understood or comprehended the content. We investigate pre-fusion and post-fusion strategies over a number of input streams and demonstrate the most effective modalities for classification of comprehension. We demonstrate that it is possible to build a classifier to predict potential audience comprehension levels, obtaining accuracy over a 7-class range of 52.9%, and over a binary classification problem to 85.4%.

EmoReact: A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children
Behnaz Nojavanasghari, Tadas Baltrušaitis, Charles E. Hughes, and Louis-Philippe Morency
(University of Central Florida, USA; Carnegie Mellon University, USA)
Automatic emotion recognition plays a central role in the technologies underlying social robots, affect-sensitive human computer interaction design and affect-aware tutors. Although there has been a considerable amount of research on automatic emotion recognition in adults, emotion recognition in children has been understudied. This problem is more challenging as children tend to fidget and move around more than adults, leading to more self-occlusions and non-frontal head poses. Also, the lack of publicly available datasets for children with annotated emotion labels leads most researchers to focus on adults. In this paper, we introduce a newly collected multimodal emotion dataset of children between the ages of four and fourteen years old. The dataset contains 1102 audio-visual clips annotated for 17 different emotional states: six basic emotions, neutral, valence and nine complex emotions including curiosity, uncertainty and frustration. Our experiments compare unimodal and multimodal emotion recognition baseline models to enable future research on this topic. Finally, we present a detailed analysis of the most indicative behavioral cues for emotion recognition in children.

Bimanual Input for Multiscale Navigation with Pressure and Touch Gestures
Sebastien Pelurson and Laurence Nigay
(University of Grenoble, France; LIG, France; CNRS, France)
We explore the combination of touch modalities with pressure-based modalities for multiscale navigation in bifocal views. We investigate a two-hand mobile configuration in which: 1) The dominant hand is kept free for precise touch interaction at any scale of a bifocal view, and 2) The non-dominant hand is used for holding the device in landscape mode, keeping the thumb free for pressure input for navigation at the context scale. The pressure sensor is fixed to the front bezel. Our investigation of pressure-based modalities involves two design options: control (continuous or discrete) and inertia (with or without). The pressure-based modalities are compared to touch-only modalities: the well-known drag-flick and drag-drop modalities. The results show that continuous pressure-based modality without inertia is 1) the fastest one along with the drag-drop touch modality 2) is preferred by the users and 3) importantly minimizes screen occlusion during a phase that requires navigating a large part of the information space.

Intervention-Free Selection using EEG and Eye Tracking
Felix Putze, Johannes Popp, Jutta Hild, Jürgen Beyerer, and Tanja Schultz
(University of Bremen, Germany; KIT, Germany; Fraunhofer IOSB, Germany)
In this paper, we show how recordings of gaze movements (via eye tracking) and brain activity (via electroencephalography) can be combined to provide an interface for implicit selection in a graphical user interface. This implicit selection works completely without manual intervention by the user. In our approach, we formulate implicit selection as a classification problem, describe the employed features and classification setup and introduce our experimental setup for collecting evaluation data. With a fully online-capable setup, we can achieve an F_0.2-score of up to 0.74 for temporal localization and a spatial localization accuracy of more than 0.95.

Automated Scoring of Interview Videos using Doc2Vec Multimodal Feature Extraction Paradigm
Lei Chen, Gary Feng, Chee Wee Leong, Blair Lehman, Michelle Martin-Raugh, Harrison Kell, Chong Min Lee, and Su-Youn Yoon
(Educational Testing Service, USA)
As the popularity of video-based job interviews rises, so does the need for automated tools to evaluate interview performance. Real world hiring decisions are based on assessments of knowledge and skills as well as holistic judgments of person-job fit. While previous research on automated scoring of interview videos shows promise, it lacks coverage of monologue-style responses to structured interview (SI) questions and content-focused interview rating. We report the development of a standardized video interview protocol as well as human rating rubrics focusing on verbal content, personality, and holistic judgment. A novel feature extraction method using ``visual words" automatically learned from video analysis outputs and the Doc2Vec paradigm is proposed. Our promising experimental results suggest that this novel method provides effective representations for the automated scoring of interview videos.

Estimating Communication Skills using Dialogue Acts and Nonverbal Features in Multiple Discussion Datasets
Shogo Okada, Yoshihiko Ohtake, Yukiko I. Nakano, Yuki Hayashi, Hung-Hsuan Huang, Yutaka Takase, and Katsumi Nitta
(Tokyo Institute of Technology, Japan; Seikei University, Japan; Osaka Prefecture University, Japan; Ritsumeikan University, Japan)
This paper focuses on the computational analysis of the individual communication skills of participants in a group. The computational analysis was conducted using three novel aspects to tackle the problem. First, we extracted features from dialogue (dialog) act labels capturing how each participant communicates with the others. Second, the communication skills of each participant were assessed by 21 external raters with experience in human resource management to obtain reliable skill scores for each of the participants. Third, we used the MATRICS corpus, which includes three types of group discussion datasets to analyze the influence of situational variability regarding to the discussion types. We developed a regression model to infer the score for communication skill using multimodal features including linguistic and nonverbal features: prosodic, speaking turn, and head activity. The experimental results show that the multimodal fusing model with feature selection achieved the best accuracy, 0.74 in R² of the communication skill. A feature analysis of the models revealed the task-dependent and task-independent features to contribute to the prediction performance.

Multi-Sensor Modeling of Teacher Instructional Segments in Live Classrooms
Patrick J. Donnelly, Nathaniel Blanchard, Borhan Samei, Andrew M. Olney, Xiaoyi Sun, Brooke Ward, Sean Kelly, Martin Nystrand, and Sidney K. D'Mello
(University of Notre Dame, USA; University of Memphis, USA; University of Wisconsin-Madison, USA; University of Pittsburgh, USA)
We investigate multi-sensor modeling of teachers’ instructional segments (e.g., lecture, group work) from audio recordings collected in 56 classes from eight teachers across five middle schools. Our approach fuses two sensors: a unidirectional microphone for teacher audio and a pressure zone microphone for general classroom audio. We segment and analyze the audio streams with respect to discourse timing, linguistic, and paralinguistic features. We train supervised classifiers to identify the five instructional segments that collectively comprised a majority of the data, achieving teacher-independent F1 scores ranging from 0.49 to 0.60. With respect to individual segments, the individual sensor models and the fused model were on par for Question & Answer and Procedures & Directions segments. For Supervised Seatwork, Small Group Work, and Lecture segments, the classroom model outperformed both the teacher and fusion models. Across all segments, a multi-sensor approach led to an average 8% improvement over the state of the art approach that only analyzed teacher audio. We discuss implications of our findings for the emerging field of multimodal learning analytics.

Oral Session 3: Groups, Teams, and Meetings
Mon, Nov 14, 10:20 - 12:00, Miraikan Hall (Chair: Nick Campbell (Trinity College Dublin))

Meeting Extracts for Discussion Summarization Based on Multimodal Nonverbal Information
Fumio Nihei, Yukiko I. Nakano, and Yutaka Takase
(Seikei University, Japan)
Group discussions are used for various purposes, such as creating new ideas and making a group decision. It is desirable to archive the results and processes of the discussion as useful resources for the group. Therefore, a key technology would be a way to extract meaningful resources from a group discussion. To accomplish this goal, we propose classification models that select meeting extracts to be included in the discussion summary based on nonverbal behavior such as attention, head motion, prosodic features, and co-occurrence patterns of these behaviors. We create different prediction models depending on the degree of extract-worthiness, which is assessed by the agreement ratio among human judgments. Our best model achieves 0.707 in F-measure and 0.75 in recall rate, and can compress a discussion into 45% of its original duration. The proposed models reveal that nonverbal information is indispensable for selecting meeting extracts of a group discussion. One of the future directions is to implement the models as an automatic meeting summarization system.

Getting to Know You: A Multimodal Investigation of Team Behavior and Resilience to Stress
Catherine Neubauer, Joshua Woolley, Peter Khooshabeh, and Stefan Scherer
(University of Southern California, USA; University of California at San Francisco, USA; Army Research Lab at Los Angeles, USA)
Team cohesion has been suggested to be a critical factor in emotional resilience following periods of stress. Team cohesion may depend on several factors including emotional state, communication among team members and even psychophysiological response. The present study sought to employ several multimodal techniques designed to investigate team behavior as a means of understanding resilience to stress. We recruited 40 subjects to perform a cooperative-task in gender-matched, two-person teams. They were responsible for working together to meet a common goal, which was to successfully disarm a simulated bomb. This high-workload task requires successful cooperation and communication among members. We assessed several behaviors that relate to facial expression, word choice and physiological responses (i.e., heart rate variability) within this scenario. A manipulation of an â€oeice breakerâ€ condition was used to induce a level of comfort or familiarity within the team prior to the task. We found that individuals in the â€oeice breakerâ€ condition exhibited better resilience to subjective stress following the task. These individuals also exhibited more insight and cognitive speech, more positive facial expressions and were also able to better regulate their emotional expression during the task, compared to the control.

Measuring the Impact of Multimodal Behavioural Feedback Loops on Social Interactions
Ionut Damian, Tobias Baur, and Elisabeth André
(University of Augsburg, Germany)
In this paper we explore the concept of automatic behavioural feedback loops during social interactions. Behavioural feedback loops (BFL) are rapid processes which analyse the behaviour of the user in realtime and provide the user with live feedback on how to improve the behaviour quality. In this context, we implemented an open source software framework for designing, creating and executing BFL on Android powered mobile devices. To get a better understanding of the effects of BFL on face-to-face social interactions, we conducted a user study and compared between four different BFL types spanning three modalities: tactile, auditory and visual. For the study, the BFL have been designed to improve the users' perception of their speaking time in an effort to create more balanced group discussions. The study yielded valuable insights into the impact of BFL on conversations and how humans react to such systems.

Analyzing Mouth-Opening Transition Pattern for Predicting Next Speaker in Multi-party Meetings
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka
(NTT, Japan)
Techniques that use nonverbal behaviors to predict turn-changing situations—e.g., predicting who will speak next and when, in multi-party meetings—have been receiving a lot of attention in recent research. In this research, we explored the transition pattern of the degree of mouth opening (MOTP) towards the end of an utterance to predict the next speaker in multiparty meetings. First, we collected data on utterances and on the degree of mouth opening (closed, slightly open, and wide open) from participants in four-person meetings. The representative results of the analysis of the MOTPs showed that the speaker often continues to open the mouth slightly in turn-keeping and starts to close the mouth from opening it slightly or continues to open the mouth largely in turn-changing. The next speaker often starts to open the mouth slightly from closing it in turn-changing. On the basis of these findings, we constructed next-speaker prediction models using the MOTPs. In addition, as a multimodal fusion, we constructed models using the MOTPs and gaze information, which is known to be one of the most useful types of information for the next-speaker prediction. The evaluation of the models suggests that the speaker's and listeners' MOTPs are effective for predicting the next speaker in multi-party meetings. It also suggests the multimodal fusion using the MOTP and gaze information is more useful for the prediction than using one or the other.

Oral Session 4: Personality and Emotion
Mon, Nov 14, 13:20 - 14:40, Miraikan Hall (Chair: Jill Lehman (Disney Research))

Automatic Recognition of Self-Reported and Perceived Emotion: Does Joint Modeling Help?
Biqiao Zhang, Georg Essl, and Emily Mower Provost
(University of Michigan, USA; University of Wisconsin-Milwaukee, USA)
Emotion labeling is a central component of automatic emotion recognition. Evaluators are asked to estimate the emotion label given a set of cues, produced either by themselves (self-report label) or others (perceived label). This process is complicated by the mismatch between the intentions of the producer and the interpretation of the perceiver. Traditionally, emotion recognition systems use only one of these types of labels when estimating the emotion content of data. In this paper, we explore the impact of jointly modeling both an individual's self-report and the perceived label of others. We use deep belief networks (DBN) to learn a representative feature space, and model the potentially complementary relationship between intention and perception using multi-task learning. We hypothesize that the use of DBN feature-learning and multi-task learning of self-report and perceived emotion labels will improve the performance of emotion recognition systems. We test this hypothesis on the IEMOCAP dataset, an audio-visual and motion-capture emotion corpus. We show that both DBN feature learning and multi-task learning offer complementary gains. The results demonstrate that the perceived emotion tasks see greatest performance gain for emotionally subtle utterances, while the self-report emotion tasks see greatest performance gain for emotionally clear utterances. Our results suggest that the combination of knowledge from the self-report and perceived emotion labels lead to more effective emotion recognition systems.

Personality Classification and Behaviour Interpretation: An Approach Based on Feature Categories
Sheng Fang, Catherine Achard, and Séverine Dubuisson
(UPMC, France)
This paper focuses on recognizing and understanding social dimensions (the personality traits and social impressions) during small group interactions. We extract a set of audio and visual features, which are divided into three categories: intra-personal features (i.e. related to only one participant), dyadic features (i.e. related to a pair of participants) and one vs all features (i.e. related to one participant versus the other members of the group). First, we predict the personality traits (PT) and social impressions (SI) by using these three feature categories. Then, we analyse the interplay be- tween groups of features and the personality traits/social impressions of the interacting participants. The prediction is done by using Support Vector Machine and Ridge Regression which allows to determine the most dominant features for each social dimension. Our experiments show that the combination of intra-personal and one vs all features can greatly improve the prediction accuracy of personality traits and social impressions. Prediction accuracy reaches 81.37% for the social impression named ’Rank of Dominance’. Finally, we draw some interesting conclusions about the relationship between personality traits/social impressions and social features.

Multiscale Kernel Locally Penalised Discriminant Analysis Exemplified by Emotion Recognition in Speech
Xinzhou Xu, Jun Deng, Maryna Gavryukova, Zixing Zhang, Li Zhao, and Björn Schuller
(Southeast University, China; TU Munich, Germany; University of Passau, Germany; Imperial College London, UK)
We propose a novel method to learn multiscale kernels with locally penalised discriminant analysis, namely Multiscale-Kernel Locally Penalised Discriminant Analysis (MS-KLPDA). As an exemplary use-case, we apply it to recognise emotions in speech. Specifically, we employ the term of locally penalised discriminant analysis by controlling the weights of marginal sample pairs, while the method learns kernels with multiple scales. Evaluated in a series of experiments on emotional speech corpora, our proposed MS-KLPDA is able to outperform the previous research of Multiscale-Kernel Fisher Discriminant Analysis and some conventional methods in solving speech emotion recognition.

Estimating Self-Assessed Personality from Body Movements and Proximity in Crowded Mingling Scenarios
Laura Cabrera-Quiros, Ekin Gedik, and Hayley Hung
(Delft University of Technology, Netherlands; Instituto Tecnológico de Costa Rica, Costa Rica)
This paper focuses on the automatic classification of self-assessed personality traits from the HEXACO inventory during crowded mingle scenarios. We exploit acceleration and proximity data from a wearable device hung around the neck. Unlike most state-of-the-art studies, addressing personality estimation during mingle scenarios provides a challenging social context as people interact dynamically and freely in a face-to-face setting. While many former studies use audio to extract speech-related features, we present a novel method of extracting an individual’s speaking status from a single body worn triaxial accelerometer which scales easily to large populations. Moreover, by fusing both speech and movement energy related cues from just acceleration, our experimental results show improvements on the estimation of Humility over features extracted from a single behavioral modality. We validated our method on 71 participants where we obtained an accuracy of 69% for Honesty, Conscientiousness and Openness to Experience. To our knowledge, this is the largest validation of personality estimation carried out in such a social context with simple wearable sensors.

Poster Session 2
Mon, Nov 14, 15:00 - 17:00, Conference Room 3

Deep Learning Driven Hypergraph Representation for Image-Based Emotion Recognition
Yuchi Huang and Hanqing Lu
(Chinese Academy of Sciences, China)
In this paper, we proposed a bi-stage framework for image-based emotion recognition by combining the advantages of deep convolutional neural networks (D-CNN) and hypergraphs. To exploit the representational power of D-CNN, we remodeled its last hidden feature layer as the `attribute' layer in which each hidden unit produces probabilities on a specific semantic attribute. To describe the high-order relationship among facial images, each face was assigned to various hyperedges according to the computed probabilities on different D-CNN attributes. In this way, we tackled the emotion prediction problem by a transductive learning approach, which tends to assign the same label to faces that share many incidental hyperedges (attributes), with the constraints that predicted labels of training samples should be similar to their ground truth labels. We compared the proposed approach to state-of-the-art methods and its effectiveness was demonstrated by extensive experimentation.

Towards a Listening Agent: A System Generating Audiovisual Laughs and Smiles to Show Interest
Kevin El Haddad, Hüseyin Çakmak, Emer Gilmartin, Stéphane Dupont, and Thierry Dutoit
(University of Mons, Belgium; Trinity College Dublin, Ireland)
In this work, we experiment with the use of smiling and laughter in order to help create more natural and efficient listening agents. We present preliminary results on a system which predicts smile and laughter sequences in one dialogue participant based on observations of the other participant's behavior. This system also predicts the level of intensity or arousal for these sequences. We also describe an audiovisual concatenative synthesis process used to generate laughter and smiling sequences, producing multilevel amusement expressions from a dataset of audiovisual laughs. We thus present two contributions: one in the generation of smiling and laughter responses, the other in the prediction of what laughter and smiles to use in response to an interlocutor's behaviour. Both the synthesis system and the prediction system have been evaluated via Mean Opinion Score tests and have proved to give satisfying and promising results which open the door to interesting perspectives.

Sound Emblems for Affective Multimodal Output of a Robotic Tutor: A Perception Study
Helen Hastie, Pasquale Dente, Dennis Küster, and Arvid Kappas
(Heriot-Watt University, UK; Jacobs University, Germany)
Human and robot tutors alike have to give careful consideration as to how feedback is delivered to students to provide a motivating yet clear learning context. Here, we performed a perception study to investigate attitudes towards negative and positive robot feedback in terms of perceived emotional valence on the dimensions of 'Pleasantness', 'Politeness' and 'Naturalness'. We find that, indeed, negative feedback is perceived as significantly less polite and pleasant. Unlike humans who have the capacity to leverage various paralinguistic cues to convey subtle variations of meaning and emotional climate, presently robots are much less expressive. However, they have one advantage that they can combine synthetic robotic sound emblems with verbal feedback. We investigate whether these sound emblems, and their position in the utterance, can be used to modify the perceived emotional valence of the robot feedback. We discuss this in the context of an adaptive robotic tutor interacting with students in a multimodal learning environment.

Automatic Detection of Very Early Stage of Dementia through Multimodal Interaction with Computer Avatars
Hiroki Tanaka, Hiroyoshi Adachi, Norimichi Ukita, Takashi Kudo, and Satoshi Nakamura
(Nara Institute of Science and Technology, Japan; Osaka University Health Care Center, Japan)
This paper proposes a new approach to detecting very early stage of dementia automatically. We develop a computer avatar with spoken dialog functionalities that produces natural spoken queries referring to Mini Mental State Examination, Wechsler Memory Scale-Revised and other related questions. Multimodal interactive data of spoken dialogues from 18 participants (9 dementias and 9 healthy controls) are recorded, and audiovisual features are extracted. We confirm that the support vector machines can classify into two groups with 0.94 detection performance as measured by areas under ROC curve. It is found that our system has possibilities to detect very early stage of dementia through spoken dialog with our computer avatars.

MobileSSI: Asynchronous Fusion for Social Signal Interpretation in the Wild
Simon Flutura, Johannes Wagner, Florian Lingenfelser, Andreas Seiderer, and Elisabeth André
(University of Augsburg, Germany)
Over the last years, mobile devices have become an integral part of people's everyday life. At the same time, they provide more and more computational power and memory capacity to perform complex calculations that formerly could only be accomplished with bulky desktop machines. These capabilities combined with the willingness of people to permanently carry them around open up completely new perspectives to the area of Social Signal Processing. To allow for an immediate analysis and interaction, real-time assessment is necessary. To exploit the benefits of multiple sensors, fusion algorithms are required that are able to cope with data loss in asynchronous data streams. In this paper we present MobileSSI, a port of the Social Signal Interpretation (SSI) framework to Android and embedded Linux platforms. We will test to what extent it is possible to run sophisticated synchronization and fusion mechanisms in an everyday mobile setting and compare the results with similar tasks in a laboratory environment.

Info

Language Proficiency Assessment of English L2 Speakers Based on Joint Analysis of Prosody and Native Language
Yue Zhang, Felix Weninger, Anton Batliner, Florian Hönig, and Björn Schuller
(Imperial College London, UK; Nuance Communications, Germany; University of Passau, Germany; University of Erlangen-Nuremberg, Germany)
In this work, we present an in-depth analysis of the interdependency between the non-native prosody and the native language (L1) of English L2 speakers, as separately investigated in the Degree of Nativeness Task and the Native Language Task of the INTERSPEECH 2015 and 2016 Computational Paralinguistics ChallengE (ComParE). To this end, we propose a multi-task learning scheme based on auxiliary attributes for jointly learning the tasks of L1 classification and prosody score regression. The effectiveness of this scheme is demonstrated in extensive experimental runs, comparing various standardised feature sets of prosodic, cepstral, spectral, and voice quality descriptors, as well as automatic feature selection. In the result, we show that the prediction of both prosody score and L1 can be improved by considering both tasks in a holistic way. In particular, we achieve an 11% relative gain in regression performance (Spearman's correlation coefficient) on prosody scores, when comparing the best multi- and single-task learning results.

Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution
Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang
(Microsoft Research, USA)
Crowd sourcing has become a widely adopted scheme to collect ground truth labels. However, it is a well-known problem that these labels can be very noisy. In this paper, we demonstrate how to learn a deep convolutional neural network (DCNN) from noisy labels, using facial expression recognition as an example. More specifically, we have 10 taggers to label each input image, and compare four different approaches to utilizing the multiple labels: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. We show that the traditional majority voting scheme does not perform as well as the last two approaches that fully leverage the label distribution. An enhanced FER+ data set with multiple labels for each face image will also be shared with the research community.

Deep Multimodal Fusion for Persuasiveness Prediction
Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, and Louis-Philippe Morency
(University of Central Florida, USA; Carnegie Mellon University, USA)
Persuasiveness is a high-level personality trait that quantifies the influence a speaker has on the beliefs, attitudes, intentions, motivations, and behavior of the audience. With social multimedia becoming an important channel in propagating ideas and opinions, analyzing persuasiveness is very important. In this work, we use the publicly available Persuasive Opinion Multimedia (POM) dataset to study persuasion. One of the challenges associated with this problem is the limited amount of annotated data. To tackle this challenge, we present a deep multimodal fusion architecture which is able to leverage complementary information from individual modalities for predicting persuasiveness. Our methods show significant improvement in performance over previous approaches.

Comparison of Three Implementations of HeadTurn: A Multimodal Interaction Technique with Gaze and Head Turns
Oleg Špakov, Poika Isokoski, Jari Kangas, Jussi Rantala, Deepak Akkil, and Roope Raisamo
(University of Tampere, Finland)
The best way to construct user interfaces for smart glasses is not yet known. We investigated the use of eye tracking in this context in two experiments. The eye and head movements were combined so that one can select the object to interact by looking at it and then change a setting in that object by turning the head horizontally. We compared three different techniques for mapping the head turn to scrolling a list of numbers with and without haptic feedback. We found that the haptic feedback had no noticeable effect in objective metrics, but it sometimes improved user experience. Direct mapping of head orientation to list position is fast and easy to understand, but the signal-to-noise ratio of eye and head position measurement limits the possible range. The technique with constant rate of change after crossing the head angle threshold was simple and functional, but slow when the rate of change is adjusted to suit beginners. Finally the rate of change dependent on the head angle tends to lead to fairly long task completion times, although in theory it offers a good combination of speed and accuracy.

Effects of Multimodal Cues on Children's Perception of Uncanniness in a Social Robot
Maike Paetzel, Christopher Peters, Ingela Nyström, and Ginevra Castellano
(Uppsala University, Sweden; KTH, Sweden)
This paper investigates the influence of multimodal incongruent gender cues on the perception of a robot's uncanniness and gender in children. The back-projected robot head Furhat was equipped with a female and male face texture and voice synthesizer and the voice and facial cues were tested in congruent and incongruent combinations. 106 children between the age of 8 and 13 participated in the study. Results show that multimodal incongruent cues do not trigger the feeling of uncanniness in children. These results are significant as they support other recent research showing that the perception of uncanniness cannot be triggered by a categorical ambiguity in the robot. In addition, we found that children rely on auditory cues much stronger than on the facial cues when assigning a gender to the robot if presented with incongruent cues. These findings have implications for the robot design, as it seems possible to change the gender of a robot by only changing its voice without creating a feeling of uncanniness in a child.

Multimodal Feedback for Finger-Based Interaction in Mobile Augmented Reality
Wolfgang Hürst and Kevin Vriens
(Utrecht University, Netherlands; TWNKLS, Netherlands)
Mobile or handheld augmented reality uses a smartphone's live video stream and enriches it with superimposed graphics. In such scenarios, tracking one's fingers in front of the camera and interpreting these traces as gestures offers interesting perspectives for interaction. Yet, the lack of haptic feedback provides challenges that need to be overcome. We present a pilot study where three types of feedback (audio, visual, haptic) and combinations thereof are used to support basic finger-based gestures (grab, release). A comparative study with 26 subjects shows an advantage in providing combined, multimodal feedback. In addition, it suggests high potential of haptic feedback via phone vibration, which is surprising given the fact that it is held with the other, non-interacting hand.

Smooth Eye Movement Interaction using EOG Glasses
Murtaza Dhuliawala, Juyoung Lee, Junichi Shimizu, Andreas Bulling, Kai Kunze, Thad Starner, and Woontack Woo
(Georgia Institute of Technology, USA; KAIST, South Korea; Keio University, Japan; Max Planck Institute for Informatics, Germany)
Orbits combines a visual display and an eye motion sensor to allow a user to select between options by tracking a cursor with the eyes as the cursor travels in a circular path around each option. Using an off-the-shelf Jins MEME pair of eyeglasses, we present a pilot study that suggests that the eye movement required for Orbits can be sensed using three electrodes: one in the nose bridge and one in each nose pad. For forced choice binary selection, we achieve a 2.6 bits per second (bps) input rate at 250ms per input. We also inntroduce Head Orbits, where the user fixates the eyes on a target and moves the head in synchrony with the orbiting target. Measuring only the relative movement of the eyes in relation to the head, this method achieves a maximum rate of 2.0 bps at 500ms per input. Finally, we combine the two techniques together with a gyro to create an interface with a maximum input rate of 5.0 bps.

Active Speaker Detection with Audio-Visual Co-training
Punarjay Chakravarty, Jeroen Zegers, Tinne Tuytelaars, and Hugo Van hamme
(KU Leuven, Belgium; iMinds, Belgium)
In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.

Video

Detecting Emergent Leader in a Meeting Environment using Nonverbal Visual Features Only
Cigdem Beyan, Nicolò Carissimi, Francesca Capozzi, Sebastiano Vascon, Matteo Bustreo, Antonio Pierro, Cristina Becchio, and Vittorio Murino
(IIT Genoa, Italy; McGill University, Canada; University of Venice, Italy; Sapienza University of Rome, Italy; University of Turin, Italy)
In this paper, we propose an effective method for emergent leader detection in meeting environments which is based on nonverbal visual features. Identifying emergent leader is an important issue for organizations. It is also a well-investigated topic in social psychology while a relatively new problem in social signal processing (SSP). The effectiveness of nonverbal features have been shown by many previous SSP studies. In general, the nonverbal video-based features were not more effective compared to audio-based features although, their fusion generally improved the overall performance. However, in absence of audio sensors, the accurate detection of social interactions is still crucial. Motivating from that, we propose novel, automatically extracted, nonverbal features to identify the emergent leadership. The extracted nonverbal features were based on automatically estimated visual focus of attention which is based on head pose. The evaluation of the proposed method and the defined features were realized using a new dataset which is firstly introduced in this paper including its design, collection and annotation. The effectiveness of the features and the method were also compared with many state of the art features and methods.

Stressful First Impressions in Job Interviews
Ailbhe N. Finnerty, Skanda Muralidhar, Laurent Son Nguyen, Fabio Pianesi, and Daniel Gatica-Perez
(Fondazione Bruno Kessler, Italy; CIMeC, Italy; Idiap, Switzerland; EPFL, Switzerland; EIT Digital, Italy)
Stress can impact many aspects of our lives, such as the way we interact and work with others, or the first impressions that we make. In the past, stress has been most commonly assessed through self-reported questionnaires; however, advancements in wearable technology have enabled the measurement of physiological symptoms of stress in an unobtrusive manner. Using a dataset of job interviews, we investigate whether first impressions of stress (from annotations) are equivalent to physiological measurements of the electrodermal activity (EDA). We examine the use of automatically extracted nonverbal cues stemming from both the visual and audio modalities, as well EDA stress measurements for the inference of stress impressions obtained from manual annotations. Stress impressions were found to be significantly negatively correlated with hireability ratings i.e individuals who were perceived to be more stressed were more likely to obtained lower hireability scores. The analysis revealed a significant relationship between audio and visual features but low predictability and no significant effects were found for the EDA features. While some nonverbal cues were more clearly related to stress, the physiological cues were less reliable and warrant further investigation into the use of wearable sensors for stress detection.

Oral Session 5: Gesture, Touch, and Haptics
Tue, Nov 15, 11:00 - 12:30, Miraikan Hall (Chair: Sharon Oviatt (Incaa Designs))

Analyzing the Articulation Features of Children's Touchscreen Gestures
Alex Shaw and Lisa Anthony
(University of Florida, USA)
Children’s touchscreen interaction patterns are generally quite different from those of adults. In particular, it has been established that children’s gestures are recognized by existing algorithms with much lower accuracy than are adults’ gestures. Previous work has qualitatively and quantitatively analyzed adults’ gestures to promote improved recognition, but this has not been done for children’s gestures in the same systematic manner. We present an analysis of gestures elicited from 24 children (age 5 to 10 years old) and 27 adults in which we calculate geometric, kinematic, and relative articulation features of the gestures. We examine the effect of user age on 22 different gesture features to better understand how children’s gesturing abilities and behaviors differ between various age groups, and from adults. We discuss the implications of our findings and how they will contribute to creating new gesture recognition algorithms tailored specifically for children.

Reach Out and Touch Me: Effects of Four Distinct Haptic Technologies on Affective Touch in Virtual Reality
Imtiaj Ahmed, Ville Harjunen, Giulio Jacucci, Eve Hoggan, Niklas Ravaja, and Michiel M. Spapé
(Aalto University, Finland; University of Helsinki, Finland; Liverpool Hope University, UK)
Virtual reality presents an extraordinary platform for multimodal communication. Haptic technologies have been shown to provide an important contribution to this by facilitating co-presence and allowing affective communication. However, the findings of the affective influences rely on studies that have used myriad different types of haptic technology, making it likely that some forms of tactile feedback are more efficient in communicating emotions than others. To find out whether this is true and which haptic technologies are most effective, we measured user experience during a communication scenario featuring an affective agent and interpersonal touch in virtual reality. Interpersonal touch was simulated using two types of vibrotactile actuators and two types of force feedback mechanisms. Self-reports of subjective experience of the agent’s touch and emotions were obtained. The results revealed that, regardless of the agent’s expression, force feedback actuators were rated as more natural and resulted in greater emotional interdependence and a stronger sense of co-presence than vibrotactile touch.

Using Touchscreen Interaction Data to Predict Cognitive Workload
Philipp Mock, Peter Gerjets, Maike Tibus, Ulrich Trautwein, Korbinian Möller, and Wolfgang Rosenstiel
(Leibniz-Institut für Wissensmedien, Germany; University of Tübingen, Germany)
Although a great number of today’s learning applications run on devices with an interactive screen, the high-resolution interaction data which these devices provide have not been used for workload-adaptive systems yet. This paper aims at exploring the potential of using touch sensor data to predict user states in learning scenarios. For this purpose, we collected touch interaction patterns of children solving math tasks on a multi-touch device. 30 fourth-grade students from a primary school participated in the study. Based on these data, we investigate how machine learning methods can be applied to predict cognitive workload associated with tasks of varying difficulty. Our results show that interaction patterns from a touchscreen can be used to significantly improve automatic prediction of high levels of cognitive workload (average classification accuracy of 90.67% between the easiest and most difficult tasks). Analyzing an extensive set of features, we discuss which characteristics are most likely to be of high value for future implementations. We furthermore elaborate on design choices made for the used multiple choice interface and discuss critical factors that should be considered for future touch-based learning interfaces.

Exploration of Virtual Environments on Tablet: Comparison between Tactile and Tangible Interaction Techniques
Adrien Arnaud, Jean-Baptiste Corrégé, Céline Clavel, Michèle Gouiffès, and Mehdi Ammi
(CNRS, France; University of Paris-Saclay, France)
This paper presents a preliminary study which aims to investigate the tangible navigation in 3D Virtual Environments with new tablets (ex. Google Tango) that provide a self-contained 3D full tracking (ex. Google Tango, Intel RealSense). The tangible navigation was compared with tactile navigation techniques used in standard 3D applications (ex. video games, CAD, data visualization). Four conditions were compared: classic multi-touch interaction, tactile interaction with sticks, tangible interaction with a 1:2.5 scale factor and tangible interaction with a 1:5 scale factor. The study focuses on the subjective evaluation users make of the different interaction techniques in terms of usability and acceptability, and the fidelity of representation of the explored scene. Participants were asked to explore a virtual appartment using one of the four interaction techniques. Then, they were requested to draw a 2D sketch map of the explored appartment and to complete a questionnaire of usability and acceptabilty. In order to go further into the investigation, the participant's activity has also been recorded. First results show that exploring a virtual environment using only the device's movements is more usable and acceptable compared to tactile interaction technique, but the difference seems to become smaller as the used scale factor increases.

Oral Session 6: Skill Training and Assessment
Tue, Nov 15, 14:00 - 15:10, Miraikan Hall (Chair: Catherine Pelachaud (ISIR, University of Paris6))

Understanding the Impact of Personal Feedback on Face-to-Face Interactions in the Workplace
Afra Mashhadi, Akhil Mathur, Marc Van den Broeck, Geert Vanderhulst, and Fahim Kawsar
(Nokia Bell Labs, Ireland; Nokia Bell Labs, Belgium)
Face-to-face interactions have proven to accelerate team and larger organisation success. Many past research has explored the benefits of quantifying face-to-face interactions for informed workplace management, however to date, little attention has been paid to understand how the feedback on interaction behaviour is perceived at a personal scale. In this paper, we offer a reflection on the automated feedback of personal interactions in a workplace through a longitudinal study. We designed and developed a mobile system that captured, modelled, quantified and visualised face-to-face interactions of 47 employees for 4 months in an industrial research lab in Europe. Then we conducted semi-structured interviews with 20 employees to understand their perception and experience with the system. Our findings suggest that the short-term feedback on personal face-to-face interactions was not perceived as an effective external cue to promote self-reflection and that employees desire long-term feedback annotated with actionable attributes. Our findings provide a set of implications for the designers of future workplace technology and also opens up avenues for future HCI research on promoting self-reflection among employees.

Asynchronous Video Interviews vs. Face-to-Face Interviews For Communication Skill Measurement: A Systematic Study
Sowmya Rasipuram, Pooja Rao S. B., and Dinesh Babu Jayagopi
(IIIT Bangalore, India)
Communication skill is an important social variable in em- ployment interviews. As recent trends point to, increasingly asynchronous or interface-based video interviews are becom- ing popular. Also getting increasing interest is automatic hiring analysis, of which automatic communication skill pre- diction is one such task. In this context, a research gap that exists and which our paper addresses is â€oeAre there any differences in perception of communication skill and the accuracy of automatic prediction of say classes of communicators (e.g. those below average) when we compare interface-based and face-to-face interviewsâ€. To this end, we have collected a set of 106 interview videos from graduate students in both the settings i.e., interface-based and face-to-face. We observe that perception of behavior of participants in interface-based (when no person is involved) vs. face-to-face (when inter- viewer is involved) according to the external naive observers is slightly different. In this paper, we present an automatic system to predict the communication skill of a person in interface-based and face-to-face interviews by automatically extracting several low level features based on audio, visual and lexical behavior of the participants and using Machine Learning algorithms like Linear Regression, Support Vec- tor Machine (SVM) and Logistic Regression. We also make an extensive study of the verbal behavior of the participant when the spoken response is obtained from manual tran- scriptions and Automatic Speech Recognition (ASR) tool. Our best automatic prediction results achieve an accuracy of 80% in interface-based and 83% in face-to-face setting.

Context and Cognitive State Triggered Interventions for Mobile MOOC Learning
Xiang Xiao and Jingtao Wang
(University of Pittsburgh, USA)
We present Context and Cognitive State triggered Feed-Forward (C2F2), an intelligent tutoring system and algorithm, to improve both student engagement and learning efficacy in mobile Massive Open Online Courses (MOOCs). C2F2 infers and responds to learners' boredom and disengagement events in real time via a combination of camera-based photoplethysmography (PPG) sensing and learning topic importance monitoring. It proactively reminds a learner of upcoming important content (feed-forward interventions) when disengagement is detected. C2F2 runs on unmodified smartphones and is compatible with courses offered by major MOOC providers. In a 48-participant user study, we found that C2F2 on average improved learning gains by 20.2% when compared with a baseline system without the feed-forward intervention. C2F2 was especially effective for the bottom performers and improved their learning gains by 41.6%. This study demonstrates the feasibility and potential of using the PPG signals implicitly recorded by the built-in camera of smartphones to facilitate mobile MOOC learning.

Native vs. Non-native Language Fluency Implications on Multimodal Interaction for Interpersonal Skills Training
Mathieu Chollet, Helmut Prendinger, and Stefan Scherer
(University of Southern California, USA; National Institute of Informatics, Japan)
New technological developments in the field of multimodal interaction show great promise for the improvement and assessment of public speaking skills. However, it is unclear how the experience of non-native speakers interacting with such technologies differs from native speakers. In particular, non-native speakers could benefit less from training with multimodal systems compared to native speakers. Additionally, machine learning models trained for the automatic assessment of public speaking ability on data of native speakers might not be performing well for assessing the performance of non-native speakers. In this paper, we investigate two aspects related to the performance and evaluation of multimodal interaction technologies designed for the improvement and assessment of public speaking between a population of English native speakers and a population of non-native English speakers. Firstly, we compare the experiences and training outcomes of these two populations interacting with a virtual audience system designed for training public speaking ability, collecting a dataset of public speaking presentations in the process. Secondly, using this dataset, we build regression models for predicting public speaking performance on both populations and evaluate these models, both on the population they were trained on and on how they generalize to the second population.

ICMI 2016 – Proceedings

Main Track

Oral Session 1: Multimodal Social Agents
Sun, Nov 13, 10:45 - 12:25, Miraikan Hall (Chair: Elisabeth Andre (Augsburg University))

Oral Session 2: Physiological and Tactile Modalities
Sun, Nov 13, 14:00 - 15:30, Miraikan Hall (Chair: Jonathan Gratch (University of Southern California))

Poster Session 1
Sun, Nov 13, 16:00 - 18:00, Conference Room 3

Oral Session 3: Groups, Teams, and Meetings
Mon, Nov 14, 10:20 - 12:00, Miraikan Hall (Chair: Nick Campbell (Trinity College Dublin))

Oral Session 4: Personality and Emotion
Mon, Nov 14, 13:20 - 14:40, Miraikan Hall (Chair: Jill Lehman (Disney Research))

Poster Session 2
Mon, Nov 14, 15:00 - 17:00, Conference Room 3

Oral Session 5: Gesture, Touch, and Haptics
Tue, Nov 15, 11:00 - 12:30, Miraikan Hall (Chair: Sharon Oviatt (Incaa Designs))

Oral Session 6: Skill Training and Assessment
Tue, Nov 15, 14:00 - 15:10, Miraikan Hall (Chair: Catherine Pelachaud (ISIR, University of Paris6))

ICMI 2016 – Proceedings

Main Track

Oral Session 1: Multimodal Social Agents Sun, Nov 13, 10:45 - 12:25, Miraikan Hall (Chair: Elisabeth Andre (Augsburg University))

Oral Session 2: Physiological and Tactile Modalities Sun, Nov 13, 14:00 - 15:30, Miraikan Hall (Chair: Jonathan Gratch (University of Southern California))

Poster Session 1 Sun, Nov 13, 16:00 - 18:00, Conference Room 3

Oral Session 3: Groups, Teams, and Meetings Mon, Nov 14, 10:20 - 12:00, Miraikan Hall (Chair: Nick Campbell (Trinity College Dublin))

Oral Session 4: Personality and Emotion Mon, Nov 14, 13:20 - 14:40, Miraikan Hall (Chair: Jill Lehman (Disney Research))

Poster Session 2 Mon, Nov 14, 15:00 - 17:00, Conference Room 3

Oral Session 5: Gesture, Touch, and Haptics Tue, Nov 15, 11:00 - 12:30, Miraikan Hall (Chair: Sharon Oviatt (Incaa Designs))

Oral Session 6: Skill Training and Assessment Tue, Nov 15, 14:00 - 15:10, Miraikan Hall (Chair: Catherine Pelachaud (ISIR, University of Paris6))

Oral Session 1: Multimodal Social Agents
Sun, Nov 13, 10:45 - 12:25, Miraikan Hall (Chair: Elisabeth Andre (Augsburg University))

Oral Session 2: Physiological and Tactile Modalities
Sun, Nov 13, 14:00 - 15:30, Miraikan Hall (Chair: Jonathan Gratch (University of Southern California))

Poster Session 1
Sun, Nov 13, 16:00 - 18:00, Conference Room 3

Oral Session 3: Groups, Teams, and Meetings
Mon, Nov 14, 10:20 - 12:00, Miraikan Hall (Chair: Nick Campbell (Trinity College Dublin))

Oral Session 4: Personality and Emotion
Mon, Nov 14, 13:20 - 14:40, Miraikan Hall (Chair: Jill Lehman (Disney Research))

Poster Session 2
Mon, Nov 14, 15:00 - 17:00, Conference Room 3

Oral Session 5: Gesture, Touch, and Haptics
Tue, Nov 15, 11:00 - 12:30, Miraikan Hall (Chair: Sharon Oviatt (Incaa Designs))

Oral Session 6: Skill Training and Assessment
Tue, Nov 15, 14:00 - 15:10, Miraikan Hall (Chair: Catherine Pelachaud (ISIR, University of Paris6))