A case study in the use of proximal sensors for constructing multimodal datasets.
Converting the behavioral data streams afforded by proximal sensors into explorable multimodal datasets is a new challenge faced by both HCI practitioners as well as the behavioral research community focused on more theoretical questions. We intend to present a case study that demonstrates a particular approach to this problem by leveraging open web standards and lightweight development tools for rapid prototyping.
In our submission to the workshop, we intend to describe our development toolchain, which we’ve found to be particularly appropriate for this task, and to highlight some basic techniques for working with (viz., capturing, filtering, synchronizing, rendering, extracting, analyzing, and replicating) the varied behavioral activity streams made available by proximal sensors.
Voice, image, touch, motion, orientation, and geolocation sensors are now ubiquitous. They’re embedded in the communication devices we carry around in our pockets. Complementing this already pervasive technology is a new breed of compact, powerful, low-cost motion and depth sensors. We’ll refer to the various sensors designed to gauge user behavior within close range of the human body as proximal sensors, distinguishing them from distal sensors that are designed to sense activity from a distance.
Much of the appeal of proximal sensors lies in their potential for enabling new forms of perceptual computing and natural user interaction, especially when used in concert with one another. They can function as a coordinated set of input devices, each sensing one facet of the user’s behavior or context, which can then in turn be rendered in an appropriate form for the user to respond to as part of a tight perceptual feedback loop.
However, designing new forms of user interaction requires a lot of empirical groundwork: gathering observations, testing what works and what doesn’t. And therein lies a futher appeal of proximal sensors: they can function as effective tools for gathering the very behavioral data that is being designed for and needs to be evaluated.
But while enabling new forms of behavorial data collection, integrating this new range of data into a unified event stream synchronously with conventional A/V media recordings can be a challenge. This is a challenge faced both by HCI practitioners and the behavioral research community focused on more theoretical questions.
We intend to present a case study that demonstrates a particular approach to this problem by leveraging open web standards and lightweight development tools for rapid prototyping. Our approach emerged in the course of a research study focused on action gestures, described below. With the arrival of proximal sensors and open web standards, we were able to piece together a suite of low-cost tools and techniques for producing explorable multimodal datasets. The tools and techinques we describe should be of use to interaction researchers seeking new ways of exploring nonverbal communication and proximal action streams generally.
In next section we elaborate on just one of the motivating problems we encountered in our research study. We aim to address this problem in some detail in our final submission.
For anyone looking to utilize this new breed of sensors, one particular problem is integrating their output into a unified, synchronous event stream to enable simultaneous rendering and faceted playback.
Each sensor stream captures particular behavioral features of the user/subject’s activity. Enabling researchers to selectively view particular sets of features during playback (faceted playback) can be of great value during the exploratory phase of research.
For example, a researcher might want to capture gestural movements performed by a subject with a close-range time-of-flight sensor while using an embedded video camera and mic to record their image and speech. A researcher would ideally like to view the A/V media in tight synchrony with a rendering of the captured motion data and select particular facets of the motion stream in the course of playback. It may be useful to view finger orientation relative to the hand in one instance, but then turn off finger rendering and isolate hand movements along a particular axis to better scrutinize certain gesture transition points.
We’ll describe one approach to synchronization and faceted playback with media cueing standards that handle timed metadata. We’ll also look at some emerging standards for media/data integration at the time of capture.
The set of practices we intend to describe in our submission emerged in the course of an experimental study focused on action gestures. For this study we utlized the Leap Motion Controller, a low-cost motion tracking device, to capture gestures produced by the subjects in the study. What follows is a description of the research question targeted in our study.
When people describe actions in the world, such as how to tie a tie or how to assemble a piece of furniture, they produce gestures that look to the naked eye much like the actions they are describing. This observation has led many researchers to endorse the view that gesture is a kind of “simulated action”. In our first project using the LEAP we are seeking to understand precisely how action gestures are like (and unlike) the actions they represent.
Actions in the world require the finely calibrated deployment of force to move objects fluidly and safely. Do gestures somehow represent this kind of force information, too, or do they abstract it away? To address this question we are using a classic reasoning puzzle, the Tower of Hanoi, which requires the solver to move disks of different weights on to and off of pegs. Though the physical details of the puzzle (e.g. the height of the pegs, size of the disks, etc.) are irrelevant to the logical structure of the solution, people reliably encode such information in gesture when describing their solutions. Previous measurements of people’s Tower of Hanoi gestures have been relatively coarse-grained, however, limited by what features of gesture could be reliably extracted from video. By using the LEAP to track people’s gestures as they describe moving the disks, we hope to learn whether their movements differ systematically according to the weight of the disk they are describing.
In the course of this study we’ve developed a number of basic command-line tools for capturing, filtering, viewing and extracting gesture data. An outline of how we’re using these tools to record gesture samples and extract position and veloctiy data can be found here.
Proximal sensors are arriving at a time of surging interest in the dynamics of human movement across the cognitive sciences. Over the last decade researchers have begun to see movement in all its richness — the precise arc the hand takes during reaching, the swaying of the hips as one weighs a decision — as a special window into the moment-to-moment dynamics of cognition (Spivey, 2007). One type of movement in particular that has captured the attention of researchers is hand gestures. Video has been the traditional tool for analyzing such gestures, but researchers are beginning to run up against the limits of this technology, formulating questions which would be extremely time-consuming to answer with video data or which cannot be answered at all.
While motion-tracking technologies have been used in research for several years, their use has been prohibitive for all but the most determined and invested researchers. Aside from the obvious barrier of cost, there has been an even more daunting barrier of expertise. Extracting and analyzing data from a motion-capture system requires intensive training if not dedicated personnel. With the arrival of proximal sensors and open web standards, we have a new suite of low-cost tools and techniques for producing explorable multimodal datasets. These datasets can be designed to enable researchers to graphically visualize different dimensions of the recorded data and to query and annotate particular portions of it. For a gesture researcher, for example, such tools might have the “look and feel” of video, with similar playback controls, while also reducing the dimensionality of the data to suit the researcher’s needs.