VISUAL SCENE AUGMENTATION

FOR ENHANCED HUMAN PERCEPTION

Daniel Hahn, Frederik Beutler and Uwe D. Hanebeck

Intelligent Sensor-Actuator-Systems Laboratory

Institute of Computer Science and Engineering

Universit

at Karlsruhe (TH)

Karlsruhe, Germany

Keywords:

Augmented Reality, Human-Machine-Interface.

Abstract:

In this paper we present an assistive system for hearing-impaired people that consists of a wearable microphone

array and an Augmented Reality (AR) system. This system helps the user in communication situations, where

many speakers or sources of background noise are present. In order to restore the “cocktail party” effect

multiple microphones are used to estimate the position of individual sound sources. In order to allow the

user to interact in complex situations with many speakers, an algorithm for estimating the user’s attention

is developed. This algorithm determines the sound sources, which are in the user’s focus of attention. It

allows the system to discard irrelevant information and enables the user to focus on certain aspects of the

surroundings. Based on the user’s hearing impairment, the perception of the speaker in the focus of attention

can be enhanced, e.g. by ampliﬁcation or using a speech-to-text conversion.

A prototype has been built for evaluating this approach. Currently the prototype is able to locate sound

beacons in three-dimensional space, to perform a simple focus estimation, and to present ﬂoating captions

in the Augmented Reality. The prototype uses an intentionally simple user interface, in order to minimize

distractions.

1 INTRODUCTION

Hearing impairments have grave consequences on a

person’s social life. Everyday tasks and social inter-

action depend on spoken language, a form of com-

munication from which the hearing impaired are of-

ten cut off. Speechreading skills and assistive tech-

nology (such as conventional hearing aids or cochlear

implants) can remedy the problems to a degree. These

approaches, however, can only assist a person’s re-

maining cognitive capabilities. They do not aim at

restoring the complex functions of human hearing,

and thus do not work well in complex situations.

These limitations of conventional hearing aids are

best illustrated by their failure restoring the “cocktail

party” effect – the phenomenon that allows us to fo-

cus on a certain speaker and put other speakers and

noises to the background.

For this reason, hearing-impaired people face an in-

creased danger of social isolation. This is especially

true for those who became deaf later in life, and have

grown into a world of spoken language. While these

people are able to express themselves orally, they are

often unable to understand what is said. Therefore

they may avoid social interactions, or at least situa-

tions with which their hearing aids cannot cope.

The system in this paper was born out of an idea to

overcome these limitation: A novel hearing aid which

would combine a wearable microphone array with an

augmented reality headset. By using multiple redun-

dant microphones, it will be possible to estimate the

position of the individual sound sources surrounding

the user, and to isolate their signals from the back-

ground noise. The sound information will then be vi-

sually presented to the user in the augmented reality

headset, using the mode of presentation that is best

suited for the user’s abilities.

In this paper we will present a ﬁrst prototype of the

system, as well as methods for selecting the audio in-

formation and integrating them into the user’s reality.

Augmented reality systems integrate virtual objects

into the user’s real surroundings (Azuma, 1997). In

the case of the hearing aid, the sound sources are aug-

mented with virtual representations of their content.

In the case of spoken language, these will most likely

be textual representations of the speech, which are at-

tached to the respective speaker (e.g. speech bubbles).

Augmented reality can be regarded as a new form

146

Hahn D., Beutler F. and D. Hanebeck U. (2005).

VISUAL SCENE AUGMENTATION FOR ENHANCED HUMAN PERCEPTION.

In Proceedings of the Second International Conference on Informatics in Control, Automation and Robotics - Robotics and Automation, pages 146-153

DOI: 10.5220/0001173701460153

 SciTePress

Let's go!

Sounds good to me!

<ahem>

Figure 1: Mockup of a possible user interface.

of user interface and requires new forms of user in-

teraction. There have been attempts to bring ele-

ments of traditional GUI interfaces into the AR, like

the Studierstube project (Schmalstieg et al., 2002).

Other researchers tried to use novel approaches, in-

corporating tangible objects into the interface (Tan

et al., 2001). All these attempts regard the AR sys-

tem as a tool that requires constant interaction. There

have been some attempts on information-only sys-

tems, like the emergency room prototype by Kauf-

man, Billinghurst et. al. (Kaufman et al., 1997).

However, no accepted design rules for such systems

seem to have evolved.

A problem for the development of the new hearing

aid is the correct registration of the augmented world.

The virtual elements have to be perfectly aligned with

the virtual word in order to be convincing. This is

a problem that has not been completely solved yet

(Azuma, 1997). For the hearing aid the registration

does not need to be as perfect as in other applications,

but it must work without the help of external tracking

devices or visual markers, as used in the ARToolkit

(Billinghurst et al., 2004).

In order to work in complex situations, and to re-

store the cocktail party effect, the system needs to

know which speakers need to be represented in the

AR, and which sounds need to be suppressed. Crit-

ical to this is the concept of human attention. The

mechanisms of attention, which, in people with nor-

mal hearing, functions without conscious effort, have

to be restored.

This requires that the system contains a model of

the user’s attention. Such a model will allow the sys-

tem to intelligently decide which information should

be augmented into the user’s reality, and which should

be discarded. This will enable the system to reduce

the amount of information to a level which the user

can easily understand.

The information needs to be presented through an

intuitive and non-obtrusive interface. The interface it-

self must not inhibit the user in any way or interfere

with his everyday tasks. This precludes the use of

complex and graphics-heavy interfaces. A major part

of the interface design will be the visual representa-

tion of spoken language. While different approaches

are possible, a textual representation seems to be the

most intuitive for the ﬁrst prototype (Figure 1).

Some ideas for such an interface can be found in a

class of multimodal user interfaces, known as “atten-

tive” or “perceptual”, which pioneered the use of hu-

man attention in user interfaces. Examples are Verte-

gaal’s Attentive Interfaces (Vertegaal, 2002a) or Pent-

land’s Perceptual Intelligence (Pentland, 2000).

These systems attempt to monitor the attention of

their users, in order to interact with them more in-

telligently. An example is Vertegaal’s attentive cell

phone, which observes the user’s conversational part-

ner to determine whether a call should be put through

(Vertegaal et al., 2002).

The primary method of getting information about

the user’s attention is through the observation of the

gaze (Sibert and Jacob, 2000), (Vertegaal, 2002b).

Since professional gaze-tracking equipment is bulky

and expensive, many researchers attempt to build sim-

ple eye-tracking tools, using off-the-shelf hardware

like webcams. An example is Vertegaal’s “eyes” sys-

tem (Shell et al., 2003). Stiefelhagen tries to ascertain

the gaze by tracking the head pose, with surprisingly

good results (Stiefelhagen, 2002).

Attentive interfaces are usually seen as extensions

of graphical user interfaces (GUIs) (Vertegaal, 2003).

They are supposed to mediate the human-computer-

interaction (Shell et al., 2003). In that capacity their

role can be describe as that of an “intelligent ob-

server” with a social awareness of communication.

These systems do not necessarily attempt to fully

model the user’s attention, as it is necessary for the

hearing aid.

The structure of the paper is as follows. In Sec-

tion 2 models for human attention are described. In

Section 3 a model for the human attention is deduced.

In Section 3.1 we describe how the attentional state

is estimated. In Section 3.2 the target selection and

processing are described. The prototype of the system

is presented in Section 4. In Section 4.1 an overview

over the system architecture is given. Further in Sec-

tion 4.2 the hardware setup and in Section 4.3 the

software setup are described. Experiments with the

prototype are shown in Section 4.4. In Section 5 the

results of the experiments are presented. Conclusions

and some details on future investigations are given in

Section 6.

VISUAL SCENE AUGMENTATION FOR ENHANCED HUMAN PERCEPTION

147

2 HUMAN ATTENTION

Attention is the ability to selectively focus on certain

parts of one’s surroundings, while disregarding the

other parts. Attention has often been compared to a

spotlight, which selectively illuminates objects in a

dark room.

Human attention has been extensively studied by

cognitive psychologists, and there’s a wealth of liter-

ature available on the issue (Chun and Wolfe, 2001).

There are two prevailing schools of thought within

the literature: Filter or attenuation theories, as propa-

gated by Broadbent (Broadbent, 1958) and Treisman

(Treisman and Gelade, 1980), assume that attention

works like a ﬁlter. Unneeded perceptions are either

removed or toned down, and do not enter conscious-

ness.

Resource models, on the other hand, propose that

attention is created by the distribution of limited at-

tentional resources (Cohen, 2003). Those process-

ing resources can be allocated to different percep-

tions, which allows them to be consciously perceived.

Those perceptions for which no resources are avail-

able will be discarded.

Both models can be used to explain the results of

psychological experiments (Cohen, 2003). We will

primarily use the resource model, since it makes it

easy to describe attention in computational terms.

Cognitive psychology has revealed many more

mechanisms of attention (Chun and Wolfe, 2001):

• The spotlight of attention can be divided, multiple

objects can be attract attention at the same time.

However, the overall performance always remains

the same.

• Attention can be shifted through a conscious effort.

However, it can also be drawn by certain features of

the environment. For example, a blinking light will

immediately draw a person’s attention. This kind

of attention shift occurs automatically and requires

no conscious effort. This property of attention is

exploited in image processing algorithms which at-

tempt to imitate the visual attention, for an example

see (Backer and Mertsching, 2003).

• While attention has spatial properties, it can also

work on whole objects. This indicates that objects

can be identiﬁed in a preattentive processing stage.

3 A MODEL OF ATTENTION

The attention model developed for the hearing aid as-

sumes that the user’s attention can be directed at a

number of possible targets. Each of these targets is a

distinct entity corresponding to an object in the real

world. For example, a speaker in a room would be a

possible target for the user’s attention.

Each target is attributed with a target description.

The descriptions contains the raw sensor data from

that target, and may also contain semantic informa-

tion that can be used for estimating the user’s atten-

tional focus. A target description for a speaker may

consist of the raw audio data from this speaker and

the speaker’s position relative to the user.

The attentional state of the user is the distribution

of the user’s attention over the existing targets. The

distribution is expressed, for each target, as the prob-

ability that the target is the user’s primary focus of

attention. This model is consistent with the psycho-

logical results which indicate that attention is directed

at objects, rather than abstract features.

For estimating the attentional state the possible tar-

gets have to been detected in the sensor information,

and each target’s sensor data is extracted separately.

This may seem like an excessive burden on the pre-

processing stage. However, in the case of the hearing

aid, advanced audio processing has to be an integral

part of the system anyway. All sound sources will

have to be identiﬁed and localized, and there has to

be a possibility to enhance each sound source sepa-

rately or feed it to a speech recognition system.

3.1 Estimating the attentional state

The model for the user attention consists of a num-

ber of rules. By assigning a probability to each target

the algorithm creates an estimate of the user’s atten-

tional state. Since the rules are interchangeable, dif-

ferent approaches may be evaluated. This is necessary

since psychological experiments suggest a wealth of

approaches, but it is often unclear how they will be-

have in real-life systems.

There are two basic approaches to determine the

user’s attentional state. One is by predicting the at-

tention based on the user’s current perceptions. The

other is to monitor the user’s behavior in order to ﬁnd

out where the attention is directed.

Figure 2 shows a coarse overview over the mech-

anisms of the algorithm. We assume that the user

receives perceptions or stimuli from the world and,

depending on his current attentional state, reacts to

those stimuli. The stimuli are recorded by sensors and

transformed into target descriptions in a preprocess-

ing stage. Based on the target descriptions and the

user model the user’s most likely attentional state is

estimated.

Simultaneously, the user’s reactions are monitored.

Through the reactions, the system may observe the

user’s attentional state. Any differences between the

estimated and observed state are fed back into the

model.

ICINCO 2005 - ROBOTICS AND AUTOMATION

148

AGP Bus

Tracking System

Tracking

Control

Corba

CPU

Video Card

Video

Signal

HMD

Video

Rendering

Augmen-

tation

Control/

Event-

handling

Video

Capture

Framegrab

API

Frame Grabber

Card

Camera System

CAM-

Link

PCI

BUS

Figure 3: Components of the Ve system.

3UHSURFHVVLQJ

8VHU0RGHO 8VHU

2EVHUYH$WWHQWLRQ

3HUFHSWLRQV

6WLPXOL

5HDFWLRQV



$WWHQWLRQDO6WDWH

REVHUYHG

$WWHQWLRQDO6WDWH

SUHGLFWHG

7DUJHW

GHVFULSWLRQV

Figure 2: Algorithm for estimating the attentional state.

Observing attention If a person shifts his or her

attention, this shift will often result in a behavior

that can be registered by the system. This approach

will be especially useful monitoring the users con-

scious, extrinsic attention – since the extrinsic atten-

tion is guided by the user’s will, it cannot be expressed

through ﬁxed rules.

The best known method observing the user’s atten-

tion is by tracking the user’s gaze. Since the classi-

cal experiments of Yarbus it is know that there is a

close connection between the eye movements and a

person’s focus of attention (Yarbus, 1967). This con-

nection has been exploited many times, especially for

user interface designs. Vertegaal calls eye-based in-

teraction an ‘almost magical window into the mind of

the user’ (Vertegaal, 2002b).

For the hearing aid we simply use the head posture,

which can be easily obtained. Stiefelhagen has shown

that head posture and gaze are well correlated, and

has used this fact with great success in a conferencing

system (Stiefelhagen, 2002).

Apart from gaze tracking, there are very few meth-

ods that can be used to observe the user’s attention.

While gestures and body posture may be indicators of

attention, they are much more difﬁcult to evaluate and

not nearly as precise as gaze tracking.

Predicting attention Based on the forward user

model, which consists of a set of rules, the user’s

attentional state given the current perceptions is pre-

dicted. Any assumption about how a certain percep-

tion changes the user’s attention may be used as a rule

VISUAL SCENE AUGMENTATION FOR ENHANCED HUMAN PERCEPTION

149

in the model.

The model functions like the user’s unconscious or

intrinsic attention; it gives an estimate of the user’s at-

tentional state that is based on the probability of each

object to draw the attention. While providing a struc-

tured framework, it is open in the sense that differ-

ent approaches and rules may be evaluated within the

same framework.

An example for attention-predicting systems are

the image processing techniques that attempt to locate

a probable focus of attention within an image (Backer

and Mertsching, 2003), (Draper and Lionelle, 2003).

Such an approach may be part of our user model, but

the user model is not limited to a single approach.

For example, the system may decide that a person

speaking is much more likely to be the focus of atten-

tion than a person not speaking (Stiefelhagen, 2002)

and adjust the score accordingly. Other indicators that

may be used are the object’s distance from the user or

the volume of an audio source. The system may even

scan the audio streams for trigger words, or make as-

sumptions about the current social situation, in order

to better estimate the attentional state.

The current prototype does not yet contain a so-

phisticated user model, but relies on the user’s head

posture to observe the current attentional state. A

more complex user model will be included in future

versions, and we assume that the complexity of the

model will be dictated by the demands of the system

and the processing power available.

3.2 Target selection and processing

Once the attentional state is known, the system may

select a number of targets for presentation. The num-

ber of targets chosen will depend on the actual sit-

uation; there may be circumstances where it is nec-

essary to have more than a single focus of attention.

However, if too many targets are augmented, the user

easily becomes confused.

In order to present a target, the target description in-

cluding the raw sensor data will be transformed into a

format that is intelligible to the user. The format de-

pends on the user. For users which are hard of hearing

the separated audio signal are ampliﬁed and emitted

through earphones. For a deaf user, on the other hand,

a speech-to-text conversion through a speech recogni-

tion system can be used. Since the target description

will already contain the separated audio signal from

a single source, reliable speech recognition should be

feasible.

Other transformations are also possible: Audible

signals (e.g. a ringing phone) could be transformed to

pictograms or less important speakers could be rep-

resented through symbols. It is even imaginable that

the system transforms the speech to a sign language

representation.

4 VE PROTOTYPE

A prototype of the hearing aid was built to evaluate

the claims made in this paper, and to improve the

methods for estimating the attention. At the core of

the prototype is a custom-built AR system, which was

developed at the Intelligent Sensor-Actuator-Systems

laboratory.

4.1 System architecture

The AR system is built upon the C++ class library

. The library offers generic methods for video ac-

cess, stereoscopic camera calibration and methods to

augment the video streams.

A general overview of the system’s architecture is

shown in ﬁgure 3. At the core of the system is the Ve

software. It captures the video feeds from the cameras

and creates the virtual objects from the information

received by the tracking subsystem. The video feeds

and virtual objects are combined and presented to the

user in a head mounted display (HMD).

4.2 Hardware setup

For the prototype, we built a video see-through AR

system, using a commercially available HMD unit

(Figure 4). Video see-through units capture the real

world through cameras; the camera images are then

fed into an opaque HMD. Compared to optical sys-

tems, where the user directly sees the surroundings

through semi-transparent glasses, video systems offer

a higher degree of ﬂexibility: Every aspect of the dis-

play can be customized as needed.

The prototype’s stereoscopic camera head is

equipped with two high-resolution cameras by Sili-

con Imaging. They are connected to the controlling

PC (Pentium 4, 2.8Ghz, Windows XP) by standard

CAMLink frame grabber cards. The PC also con-

tains an nVidia Quattro graphics card. The adapter

has been chosen for it’s ability to provide two sep-

arated digital video feeds to the HMD and because

of the availability of high-quality optimised OpenGL

drivers.

A high video framerate is necessary in order to cre-

ate a realistic experience for the user. The system

will currently provide a feed with 30 frames per sec-

ond, with a resolution of 640×480 pixels, for each

eye. While some performance gains may be archived

through software optimizations, the limiting factor is

currently the maximum throughput of the PCI bus.

For the prototype, the full resolution of 1024×786

Ve was the name of a norse god, who gave humanity

speech and their external senses

ICINCO 2005 - ROBOTICS AND AUTOMATION

150

Figure 4: Experimental AR System with audio tracking.

pixels may be achieved by using more advance tech-

nology; specialized hardware is most likely necessary

for a ﬁnal version of the hearing aid.

The HMD is a commercially available high-

resolution system. The frame has been customized

to provide mounting points for the camera head and

the microphones of the tracking system. The headset

is connected to a ceiling-mounted control unit. The

setup allows the user to freely move about 3 meters

in each direction. Through the headset, the users sees

an augmented version of the surroundings, created by

the Ve software.

Four microphones have been attached to the head-

set as part of a basic audio tracking system; they are

connected to DSP board which is mounted to together

with the HMD’s control box. The tracking systems

emulates some of the functionality of the planned

microphone array. It is able to locate loudspeakers,

which emitting a known signal; the position of the

speakers is used for augmentation.

4.3 Software setup

The Ve software comprises three major components.

The video capture subsystem, an event handling and

control mechanism that connects to the tracking sys-

tem, and a video rendering subsystem, which creates

the augmented reality from the video stream and the

tracking data.

The video capture subsystem provides a generic in-

terface for accessing video sources. A Ve video mod-

ule captures the video, using the hardware-speciﬁc

APIs and protocols. The video stream is also decoded

if necessary, and the API then provides pointers to

the individual pictures in memory. The video module

also sets up the hardware and provides an API to con-

trol the capture hardware (e.g. to set a different expo-

sure on the camera). Currently only a video module

for EPIX-based frame grabber cards exists, but fur-

ther modules should be easy to implement. Each cap-

ture module runs in a separate thread, concurrently to

other tasks.

The video output is rendered using the OpenGL

video API. Camera pictures are displayed as textures,

which works well with an appropriately optimized

driver. Ve also requires the OpenGL implementation

to support the “Imaging Subset”, which contains sev-

eral image manipulation functions.

The OpenGL API was also chosen for it’s cross-

platform availability. Ve has been compiled on Linux

and MS Windows systems. The software is currently

only used on Windows, however, since the Linux

drivers are not sufﬁciently optimized.

Ve’s video output is rendered in multiple layers to

allow different modules to add to the augmentation.

Each layer may contain it’s own state information,

and the Ve library offers some utility functions, such

as stereoscopic calculations, for the layer modules.

Special modules are provided for camera calibra-

tion and AR registration. The calibration module pro-

vides an interface to the camera calibration methods

of the OpenCV toolkit. The module will also compute

the stereoscopic parameters of the camera setup from

the OpenCV data. The registration module is deter-

mines a transformation between the coordinate sys-

tems of the real and the virtual world, by solving the

underlying linear equation system. Some compensa-

tion of the non-linear distortions is also possible. Both

registration and calibration can be done interactively

from a simple HMD interface.

Ve contains a simple event handling mechanism.

Ve modules may subscribe to events created by other

modules, and thus react to changes in the environ-

ment. Currently the event handling is used for control

information and position updates.

The tracking system is not part of Ve. Tracking

information is provided to Ve through CORBA func-

tion calls. The Ve part of the tracking subsystem

is only a stub which transforms the position updates

into Ve events and notiﬁes AR modules of the update.

An augmentation module may then react to the up-

date and add an virtual object to the user’s view. For

testing purposes the tracking mechanism of the AR-

Toolkit can also be used as a tracking module.

4.4 Experimental setup

The prototype is set up for an interactive simulation

of the intended AR interface. The AR system is con-

nected to the audio tracking system, which monitors

the position of two loudspeakers in the room. Each of

VISUAL SCENE AUGMENTATION FOR ENHANCED HUMAN PERCEPTION

151

those loudspeakers represents a human speaker; pre-

deﬁned “speech messages” are placed near them in

the augmented reality.

The “speech messages” are rendered according to

the attention model presented earlier in this paper:

Speakers near the center of the screen are augmented

by large text messages, while speakers at the periph-

ery of the visual ﬁeld only get small messages or are

not augmented at all. The speaker that is currently

within the focus of attention can also be augmented

by a translucent “focus marker” or crosshairs.

In the prototype, the coordinates of the speakers

are manually registered to the virtual world. The

stereoscopic parameters from the camera calibration

are used to create the stereoscopic images of the text

object.

The prototype was evaluated by about 10 persons,

both male and female and at the age of 20 to 40 years.

The subjects were to explore the environment by turn-

ing their heads and moving about, while the loud-

speakers could be moved by the experimenter. The

subjects typically used the system for about 10 to 20

minutes.

5 RESULTS

Virtually all subjects were satisﬁed with the impres-

sion of the virtual world. The video see-through was

described as sufﬁciently realistic, even though the res-

olution had been reduced to 640×480 pixels. The

small difference between the natural eye position and

the camera position was not noticed after a short ac-

climatization period.

The subjects described the stereoscopic represen-

tation of the virtual objects as good, with the focus

markers “hovering” in front of the loudspeakers. Due

to the three dimensional view, the focus could be

clearly marked in all three dimensions.

Use of the AR system was intuitive, the subjects

were quickly able to select activate targets at will. The

main drawback of the system during these initial tests

was the weight of the head assembly, a problem which

can be ﬁxed by using more specialized hardware in

future prototypes.

6 CONCLUSIONS

In this paper we presented the concept and prototype

of a novel kind of hearing aid. The concept is based

on the vision of a system incorporating advanced sen-

sor technology, an augmented reality interface, and

intelligent signal processing.

The system is aware of the user’s attention, which

allows it to customize the interface to the user’s needs.

The attention-driven interface also allows the system

to address problems present in contemporary hearing

aids, such as the reestablishment of the cocktail-party-

effect.

We introduced a model of human attention, based

on the ﬁndings of cognitive psychology. The model

attempts to emulate attention and provides mecha-

nisms to predict the user’s behavior, as well as the

possibility to correct the predictions by monitoring

the user’s behavior. The model aims at replacing lost

attentive capacities, rather than at observing the user’s

attention externally. The system model is highly mod-

ular and can be extended for future prototypes.

A prototype has been built to evaluate the concepts.

To this end, a custom AR system has been imple-

mented with the modular class library Ve. It allows

for quick changes of the user interface and the evalu-

ation of multiple approaches.

First test runs showed the viability of the approach.

Navigation within the augmented reality appeared in-

tuitive and the users were easily able to direct their

attention.

Future prototypes will include a reﬁned attention-

prediction, and it is assumed that the system will

evolve into a small, wearable, and easy to use system.

REFERENCES

Azuma, R. T. (1997). A Survey of Augmented Reality.

Presence: Teleoperators and Virtual Environments,

6(4):355–385.

Backer, G. and Mertsching, B. (2003). Two Selection

Stages Provide Efﬁcient Object-Based Attentional

Control for Dynamic Vision. In International Work-

shop on Attention and Performance in Computer Vi-

sion, pages 9 – 16, Graz, Austria.

Billinghurst, M. et al. (2004). ARToolkit Augmented Real-

ity Toolkit. http://www.hitl.washington.edu/research

/shared

space/download/.

Broadbent, D. (1958). Perception and Communication.

Pergamon Press, London.

Chun, M. M. and Wolfe, J. M. (2001). Visual Attention. In

Blackwell’s Handbook of Perception, chapter 9, pages

272–310. Blackwell.

Cohen, A. (2003). Selective Attention. In Encyclopedia of

Cognitive Science. Nature Publishing Group (Macmil-

lan).

Draper, B. A. and Lionelle, A. (2003). Evaluation of Selec-

tive Attention under Similarity Transforms. In Inter-

national Workshop on Attention and Performance in

Computer Vision, pages 31–38, Graz, Austria.

Kaufman, N., Poupyrev, I., Miller, E., Billinghurst, M., Op-

penheimer, P., and Weghorst, S. (1997). New Inter-

face Metaphors for Complex Information Space Vi-

sualization: an ECG Monitor Object Prototype. In

ICINCO 2005 - ROBOTICS AND AUTOMATION

152

Proceedings of Medicine Meets Virtual Reality, pages

131–140.

Pentland, A. (2000). Perceptual User Interfaces: Perceptual

Intelligence. Communications of the ACM, 43(3):35–

44.

Schmalstieg, D., Fuhrmann, A., Hesina, G., Szalavari, Z.,

Encarnacao, L. M., Gervautz, M., and Purgathofer, W.

(2002). The Studierstube Augmented Reality Project.

Technical Report TR-188-2-2002-05, Interactive Me-

dia Systems Group, Institute for Software Technology

and Interactive Systems, Vienna University of Tech-

nology.

Shell, J. S., Selker, T., and Vertegaal, R. (2003). Interacting

with Groups of Computers. Communications of the

ACM, 46(3):40–46.

Sibert, L. E. and Jacob, R. J. K. (2000). Evaluation of

Eye Gaze Interaction. In Proceedings of the SIGCHI

Conference on Human Factors in Computing Systems,

pages 281–288, The Hague, The Netherlands. ACM

Press.

Stiefelhagen, R. (2002). Tracking Focus of Attention in

Meetings. In International Conference on Multimodal

Interfaces, page 273, Washington, DC, USA.

Tan, D. S., Poupyrev, I., Billinghurst, M., Kato, H., Re-

genbrecht, H., and Tetsutani, N. (2001). On-demand,

In-place Help for Augmented Reality Environments.

In Ubicomp 2001, Atlanta, GA, USA.

Treisman, A. and Gelade, G. (1980). A Feature-Integration

Theory of Attention. Cognitive Psychology, (12):97–

137.

Vertegaal, R. (2002a). Designing Attentive Interfaces. In

Proceedings of the Symposium on Eye Tracking Re-

search & Applications, pages 23–30, New Orleans,

La, USA. ACM Press.

Vertegaal, R. (2002b). What do the eyes behold for human-

computer interaction? In Proceedings of the sympo-

sium on Eye tracking research & applications, pages

59–60. ACM Press.

Vertegaal, R. (2003). Introduction. Commun. ACM,

46(3):30–33.

Vertegaal, R., Dickie, C., Sohn, C., and Flickner, M. (2002).

Designing Attentive Cell Phone using Wearable Eye-

contact Sensors. In CHI ’02 Extended Abstracts on

Human Factors in Computing Systems, pages 646–

647. ACM Press.

Yarbus, A. L. (1967). Eye Movements During Perception

of Complex Objects. In Eye Movements and Vision,

pages 171–196. Plenum Press, New York, NY, USA.

VISUAL SCENE AUGMENTATION FOR ENHANCED HUMAN PERCEPTION

153