Video Based Sensing
In Reactive Performance Spaces
Robb
Lovell
Technical
University of British Columbia
robblovell@nexus.techbc.ca
A reactive performance
space is a theatrical environment that enables physical actions to effect and
manipulate electronic media. These spaces allow performers to improvise with media
through a variety of means.
Electronic media
consists of any media that can be controlled from a computer. These are generally divided into four
categories: visuals, light, sound, and mechanical
systems. Physical actions
within the space consist of anything that can be sensed and interpreted by a
computer. This consists of things
like video based sensing, tracking systems, sound sampling, pitch detection, or
analog sensors (heat, touch, bend, acceleration, etc).
Video based sensing is
one important component to reactive spaces that provides the computer with the
means to interpret what is happening within the space. This paper presents concepts around how
a computer interprets reality through video based input.
Image understanding
(often called image processing) is the field in computer science that tries to give
the computer the ability to understand what is happening visually through the
use of cameras. It is important to
realize that, to date, no computer has really understood in a general way what
goes on in a video scene. To a
computer, the objects that are contained within a video scene, and the
movements of those objects, are just that: blobs that move. This means that telling the computer to
follow a hand moving in an image is not possible as a starting point.
So how does the
computer follow a hand moving in an image? This is the problem that image understanding tries to
solve. Perhaps it is more correct
to say to the computer:
“follow the largest moving thing in the image, assume that the
camera view is constrained to looking over the shoulder of a person (say a
conductor) of just the area that hand can reach, and that there is a constant
background”. In a more
sophisticated technique it might be: “follow the blobs in the image that
match multiple previously recorded views of a hand, within the scene lit in a
certain way, with a background that is subtracted from the current
scene”.
From the previous
examples it can be seen that the techniques used can be easily fooled by
removing the context in which the technique was created. For instance if the camera views the
conductor from the front, or if the camera is too far away, the technique might
fail.
The hope of this paper
is to give performers, designers, and artists an insight into the concepts and
processes that go into making a computer understand visually based input. Many of these concepts can be applied
to other types of sensors.
Camera’s see
space in distorted ways. Of course
they see space in much the same way as humans except there is no distance
information available (except through difficult computation). Because of this, camera geometry is not
corrected for distortions related to size and distance from the camera lens,
and there is no peripheral vision near to the camera. Objects in a scene viewed by the camera that are close
to it are big, and objects that are far from it are small.
What this means in a
performance environment is that actions of performers are distorted by their
physical relationship to the camera.
Actions that cut across the camera’s view appear to be different
than actions that move toward or away from the camera. Actions that are preformed close to the
camera are different from those performed at a distance from the camera.
There is a major
difference between a camera mounted overhead and looks down verses a camera
that is mounted looking from the side.
Overhead camera views foreshorten human bodies that are directly
underneath the camera and lengthen those that are toward the edges of the
camera’s view. An algorithm
that is designed for an overhead camera will not have to deal too much with
proportion problems (assuming that all the objects it deals with are on the
floor). But, a side viewing camera
will have to deal with distance issues, and proportion problems.
These kinds of distortions
are true of any kind of sensor.
Infrared distance sensors are only sensitive to distance within a cone
emanating from the front of the sensor.
A bend sensor only sees data where it is bent, not where the object it
is attached to is bent. It is
important to realize that sensors while sensitive to reality, do not represent
reality exactly, but only represent reality as shadows.
A camera does not see
objects in space, nor does it distinguish between bodies or boxes or
tables. This might seem obvious,
but underlying the statement is a fact that is not as obvious: Cameras only see light, not shape. The shape of something is only obtained
(if your lucky) by processing the output from a camera. Because cameras only see light, it is
important that performers working with camera based systems know that how light
falls on their bodies determines how they are seen by the computer.
Another fact that comes
out of this realization is that changes in lighting are seen by the camera as
changes of movement. If a light is
turned off, for instance, then the camera will see something completely
different in composition. Humans
can easily take this into account because we are constantly able to recognize
the content of what we see, but computers are not able to have access to this
kind of information (at least not with current techniques).
The type of camera and
type of lighting can make a difference in what the computer perceives, and how
much the performer or designer has to pay attention to lighting anomalies.
So what kind of
information can be extracted from a camera by a computer easily? The answer to this question is complex
and is dependent upon the environment the camera is used within, and the techniques
used to process the data streaming from the camera.
There are several
processing techniques that can get at particular kinds of information. This information is general in scope
and can be used to infer things based on the environmental setup. These general “operators”
process light to extract some property of the scene. These operators include but are not limited to: motion, presence, background,
and objects.
The motion operator does not extract speed, although it
implies it. Motion is calculated by subtracting successive images
from each other, and counting the number of pixels that have changed. Motion is light changing, but under constant lighting
conditions, motion is the change in surface area of objects in the scene. This precise definition is needed
because motion does not
unambiguously extract the speed of an object. To see why this is so, consider the motion of a hand just in
front of a camera lens and the motion of a hand at 10 meters from the camera. At the camera lens, a hand is big, and
any movement by a hand causes many pixels in the image plane of the camera to
change, thus the motion detected is large. At 10 meters, the hand is very small in the image plane, and
as it moves, causes only minimal changes in image.
The presence operator is the absence or presence of
light. Under constant lighting
conditions, it can imply the absence or presence of a body or any other object
that reflects light. The presence operator implies a size to the objects that are
seen that is dependent on how far from the camera the objects are placed. Changes in size of the objects are seen
as motion. It is important to realize that
anything with a texture (something with a pattern in it like a checked table
cloth) will show up as many objects to the computer.
The background operator is used to enhance the sensitivity of
the presence and motion operators.
It simply is an operator that tries to determine what is background and
what is foreground. The most
simple background technique is to grab a snapshot of a scene with nothing but
the background contained in it.
Later, this snapped scene can be subtracted from the current one to show
the objects that are not the background.
Other more sophisticated techniques involve slowly accumulating the
background over time, or more complicated statistical techniques. By subtracting the background from an
incoming scene taken from a camera, the objects that are in the foreground show
up clearly in an image. However,
if an object has the same color and intensity as the background, it will remain
invisible to the computer.
The object operator tries to find objects that are
distinct as single entities within the physical space. The result of this operator is a list
of things that look different to the computer in some way. There are a vast number of ways that
things can look different to the computer. The most common is through the division of light things from
darker ones, or through the quantification of different color spectrums. Once an object list is extracted by the
computer, there are many type of information that, in theory, could be
extracted from them. Information
such as size, speed, acceleration, and even recognition of what the object is. In practice, these parameters are
difficult to obtain reliably because of something called the
“correspondence problem” and in the case of recognition, ambiguity
in comparing stored models with the current scene. (The correspondence problem is the problem of matching the
previous scenes objects with the next, or of matching objects from two views of
two cameras.)
A threshold is a
quantity that the computer uses to divide things into categories. Simple thresholds might divide a
group of numbers into two categories or bins. Thresholds are a key tool for extracting meaning from an
image.
For instance, a
threshold that divides light from dark, tells the computer what in a scene is
lit and what is not. To see this
consider the values of pixels in a gray level image. Each pixel in an image takes on a value between 0 and 255
that directly corresponds to an intensity of light in a scene. A value of 0 is dark, and a value of
255 is light. In an environment
that uses the full spectrum of the camera’s capabilities, a threshold
value of 128 will divide the light objects from the dark objects.
When a motion operator is applied to an image by subtracting
successive frames (called a difference image), the resulting image must have a threshold
applied to determine if something is moving or not. To see this consider the following analysis. There are three possible scenarios that
result in a difference image. The first is where an object, say at
brightness 193 doesn’t move.
The resulting subtraction is 0 from frame to frame. The other two are found when the object moves across a dark
background (say intensity 41).
These two cases consist of the leading and trailing edges of the object:
193-41= 152 represents the trailing edge of the object, and 41-193= -152
represents the leading edge. Here,
two thresholds must be applied to determine how much motion is happening due to
the object, something around 100 and –100 might do the trick. Any pixel value greater than 100 and
less than –100 is a pixel corresponding to some movement.
One of the tricks of
the trade in image understanding is knowing how to set up an environment in
order to enable the computer to make assumptions about what it is seeing. Computers can extract information much
more easily within highly structured environments because the computer can
assume certain things are true about the environment, at least most of the
time.
Perhaps the two most
important parameters that can be structured are the lighting conditions, and
the content of the background. There
are no hard and fast rules, or situations that are bettor than other. In general, each situation where the
computer must extract information is different in some way.
The best way to
understand the process of discovering a technique for extracting something from
an environment is to look at an example.
Consider a situation
where the desired information to be extracted from an environment is the
position of someone within a room.
Generally, if the room is empty, this can be accomplished in a non-precise
way by having a camera with a wide angle lens mounted overhead and viewing the
room from above. Three assumptions
are made about the environment and the objects found in the camera’s
image. First, it is assumed that
people will show up as objects that are of a different color from the
background. This is not always
true, but in general, people tend not to wear the colors of the floor. Another assumption is that the objects
that are seen are people and these people are on the floor and not flying through
the room. This is only true if
people aren’t jumping up and down or suspended from cables attached to
the ceiling. Finally, it is
assumed that the lighting will remain constant and that any changes that do
occur are the result of people moving and not lighting changes. Because people are on the floor,
the position in the room of the person feet directly relates to a point in the
image, in this way the person’s location is reliably extracted if no
assumptions are violated.
But what happens if the
lighting condition is violated such as when the room contains moving
projectable surfaces, as with the art piece called Trajets
(http://ccii.banff.org/trajets/)?
In this case the above described technique will not work. Too much interference from the screens
prevents people being identified as different objects than the screens. This problem is solved in Trajets by
changing the environment and the sensing configuruation. In Trajets, moving screens hang about 1
foot off the ground allowing cameras to look underneath the screens without
interference. The addition of rope
lights around the outside edge of the piece allows the camera’s to
“see” in an otherwise dark environment. The cameras see people’s feet back lit by the rope
lights underneath the moving screens.
Once someone’s feet are seen in both cameras, it is easy calculate
their position in the room.
However, this technique breaks down as more people enter the room
because of difficulties matching objects from two camera views.
In summary, the process
of creating a situation where the computer can understand what is happening in
an environment is creative effort that involves several kinds of
activities. In general these
activities are as follows. Write
down what is known about and environment and what can be assumed. Based on these assumptions, decide on how the computer is going to
distinguish interesting objects from uninteresting objects. This will constrain camera views and
positions, and background content.
Impose any environmental changes that you can in order to enhance the
computer’s ability to distinguish interesting objects. This involves changing lighting,
background, or costumes.
Then, decide on the
processing technique that will best work in the environment.
Once information is
extracted from a scene this information is then used by the computer to make
decisions. Often this involves
transforming the data from one range to another, processes that decide when
actions or activities occur, or the application of a series of ongoing test
that fire rules.
The transformation
process is where the computer takes extracted environmental information and
converts this into intentions for action.
It is the part of the process where everything is represented virtually
in the computer. Environmental
information has been abstracted to a set of numbers that are representations of
the real state of the environment.
Actions that the computer will take as a result are also created and
manipulated as numbers and algorithms that are implemented by controller and
rendering processes down the line.
Because numbers are
abstractions of real things, there are difficulties in matching up one
abstraction with another. For
instance if a relationship between pixel values and a sequence on video is
desired, there is a difference between their abstracted representations that is
handled in the transformation step.
If pixels in images are values between 0 and 255, and DVD frame numbers
are values between 34,000 and 34,600, some correlation needs to be
established. This relationship
could be as simple as: “if the pixel values go above 128 then play frames
34,000 to 34,600”. This
rule, establishes a relationship between pixel intensity, and frame numbers
(most likely within a particular time frame established by another rule).
The goal of a vision
understanding system is to provide the computer with a means to interpret
actions that are occurring in the real world. This is a difficult task because the computer is unable to
recognize with any detail what is happening in a video image. The person creating the means for a
computer to understand part of an environment must make assumptions about the
structure and content of the environment in order to create algorithms to
extract information for the computer to use.