Language selection

Search

Patent 2437764 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent: (11) CA 2437764
(54) English Title: AUDIO CHANNEL TRANSLATION
(54) French Title: MODULATION DE CANAL AUDIO
Status: Expired
Bibliographic Data
(51) International Patent Classification (IPC):
  • H04S 3/00 (2006.01)
(72) Inventors :
  • DAVIS, MARK FRANKLIN (United States of America)
(73) Owners :
  • DOLBY LABORATORIES LICENSING CORPORATION (United States of America)
(71) Applicants :
  • DOLBY LABORATORIES LICENSING CORPORATION (United States of America)
(74) Agent: SMART & BIGGAR LLP
(74) Associate agent:
(45) Issued: 2012-04-10
(86) PCT Filing Date: 2002-02-07
(87) Open to Public Inspection: 2002-08-15
Examination requested: 2007-02-07
Availability of licence: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): Yes
(86) PCT Filing Number: PCT/US2002/003619
(87) International Publication Number: WO2002/063925
(85) National Entry: 2003-08-05

(30) Application Priority Data:
Application No. Country/Territory Date
60/267,284 United States of America 2001-02-07

Abstracts

English Abstract




A process for translating M audio input channels representing a soundfield to
N audio output channels representing the same soundfield, wherein each channel
is a single audio stream representing audio arriving from a direction, M and N
are positive whole integers, and M is at least 2, generates one or more sets
of output channels, each set having one or more output channels. Each set is
associated with two or more spatially adjacent input channels and each output
channel in a set is generated by a process that includes determining a measure
of the correlation of the two or more input channels and the level
interrelationships of the two or more input channels.


French Abstract

L'invention porte sur un procédé de modulation de canaux d'entrée audio M représentant un champ acoustique pour les convertir en canaux de sortie audio N représentant le même champ acoustique, chaque canal étant un flot de données audio simple représentant un son provenant d'une direction, M et N étant des entiers positifs et M étant au moins égal à 2. Chaque canal a pour mission de générer un ou plusieurs ensemble(s) de canaux de sortie, chaque ensemble comportant un ou plusieurs canaux de sortie et étant associé à au moins deux canaux d'entrée adjacents dans l'espace. Chaque canal de sortie contenu dans un ensemble est obtenu par un procédé qui consiste à déterminer une mesure de la corrélation de deux canaux d'entrée au moins et le niveau de rapport entre deux canaux d'entrée au moins.

Claims

Note: Claims are shown in the official language in which they were submitted.




32

CLAIMS:


1. A method for translating M audio input channels
representing a soundfield to N audio output channels
representing the same soundfield, wherein each channel is a
single audio stream representing audio arriving from a
direction, M and N are positive whole integers, and M is a
positive integer equal to two or more, comprising

a plurality of decoding modular functions, wherein
multiple modular functions share ones of said M input
channels, and each modular function either

includes a matrix that generates one or more
output channels each constituting a subset of said N
channels and controls its matrix in response to two or more
of the closest spatially adjacent cardinal channels
associated with the decoding modular function, or

generates control signals in response to two or
more of the closest spatially adjacent cardinal channels
associated with the decoding modular function, wherein the
control signals are used, along with control signals
generated by other decoding modular functions, to vary the
coefficients of a variable matrix to generate all of the
output channels, or

generates control signals in response to two or
more of the closest spatially adjacent cardinal channels
associated with the decoding modular function, wherein the
control signals are used, along with control signals
generated by other decoder modular functions to vary the
scale factors of inputs to or outputs from a fixed matrix to
generate all of the output channels.



33

2. A method according to claim 1 wherein the modular
functions are hierarchically ordered according to their
number of input channels and further comprising a
supervisory routine function communicating with the modular
functions to control the sharing of input signals in
accordance with their hierarchical ordering.

Description

Note: Descriptions are shown in the official language in which they were submitted.



CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-1-
DESCRIPTION

Audio Channel Translation
TECHNICAL FIELD
The invention relates to audio signal processing. More particularly the
invention relates to translating M audio input channels representing a
soundfield to N
audio output channels representing the same soundfield, wherein each channel
is a
single audio stream representing audio arriving from a direction, M and N are
positive whole integers, and M is at least 2.


BACKGROUND ART

Although humans have only two ears, we hear sound as a three dimensional
entity, relying upon a number of localization cues, such as head related
transfer
functions (HRTFs) and head motion. Full fidelity sound reproduction therefore

requires the retention and reproduction of the full 3D soundfield, or at least
the
perceptual cues thereof. Unfortunately, sound recording technology is not
oriented
toward capture of the 3D soundfield, nor toward capture of a 2D plane of
sound, nor
even toward capture of a 1D line of sound. Current sound recording technology
is
oriented strictly toward capture, preservation, and presentation of zero
dimensional,
discrete channels of audio.

Most of the effort on improving fidelity since Edison's original invention of
sound recording has focused on ameliorating the imperfections of his original
analog
modulated-groove cylinder/disc media. These imperfections included limited,
uneven frequency response, noise, distortion, wow, flutter, speed accuracy,
wear,

dirt, and copying generation loss. Although there were any number of piecemeal
attempts at isolated improvements, including electronic amplification, tape
recording,
noise reduction, and record players that cost more than some cars, the
traditional
problems of individual channel quality were arguably not finally resolved
until the


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-2-
singular development of digital recording in general, and specifically the
introduction
of the audio Compact Disc. Since then, aside from some effort at further
extending
the quality of digital recording to 24bits/96 kHz sampling, the primary
efforts in
audio reproduction research have been focused on reducing the amount of data

needed to maintain individual channel quality, mostly using perceptual coders,
and
on increasing the spatial fidelity. The latter problem is the subject of this
document.
Efforts on improving spatial fidelity have proceeded along two fronts: trying
to convey the perceptual cues of a full sound field, and trying to convey an

approximation to the actual original sound field. Examples of systems
employing the
former approach include binaural recording and two-speaker-based virtual
surround
systems. Such systems exhibit a number of unfortunate imperfections,
especially in
reliably localizing sounds in some directions, and in requiring the use of
headphones
or a fixed single listener position.

For presentation of spatial sound to multiple listeners, whether in a living
room
or a commercial venue like a movie theatre, the only viable alternative has
been to
try to approximate the actual original sound field. Given the discrete channel
nature
of sound recording, it is not surprising that most efforts to date have
involved what
might be termed conservative increases in the number of presentation channels.
Representative systems include the panned-mono three-speaker film soundtracks
of

the early 50's, conventional stereo sound, quadraphonic systems of the 60's,
five
channel discrete magnetic soundtracks on 70mm films, Dolby surround using a
matrix in the 70's, AC-3 5.1 channel sound of the 90's, and recently, Surround-
EX
6.1 channel sound. "Dolby", "Pro Logic" and "Surround EX" are trademarks of
Dolby Laboratories Licensing Corporation. To one degree or another, these
systems
provide enhanced spatial reproduction compared to monophonic presentation.
However, mixing a larger number of channels incurs larger time and cost
penalties
on content producers, and the resulting perception is typically one of a few
scattered,
discrete channels, rather than a continuum soundfield. Aspects of Dolby Pro
Logic


CA 02437764 2011-01-21
73221-64
-3-
decoding are described in U.S. Patent 4,799,260.
Details of AC-3 are set forth in "Digital Audio

Compression Standard (AC-3)," Advanced Television Systems Committee (ATSC),
Document A/52, December 20, 1995 (available on the World Wide Web of the

Internet at www.atsc.org/Standards/A52/a-52.doc). See also the Errata Sheet of
July
22, 1999 (available on the World Wide Web of the Internet at
www.dolby.com/tech/ATSC-err.pdf.

Insights Underlying Aspects of the Present Invention
The basis for recreating an arbitrary distribution in a source-free wave
medium
is provided by a theorem by Gauss that stipulates that a wave field within
some
region is completely specified by the pressure distribution along the boundary
of the
region. This implies that re-creation of the sound field in a concert hall
within the
confines of a living room is possible by conceptually placing the living room,
walls
impermeable to sound, within the concert hall, then electronically rendering
the walls

sonically transparent by festooning the outside of the walls with an infinite
number of
infinitesimal microphones, each connected with suitable amplification to a
corresponding loudspeaker just inside the wall. By interposing a suitable
recording
medium between microphones and speakers, a complete, if impractical, system of
accurate 3D sound reproduction is realized. The only remaining design task is
to

render the system practical.

A first step toward practicality can be taken by noting the signal of interest
is
bandlimited, at about 20 kHz, permitting the application of the Spatial
Sampling
theorem, a variant of the more common Temporal Sampling theorem. The latter
holds that there is no loss of information if a continuous bandlimited
temporal

waveform is discretely sampled at a rate at least twice the highest frequency
of the
source. The former theory follows from the same considerations to stipulate
that the
spatial sampling interval must by at least twice as dense as the shortest
wavelength in
order to avoid information loss. Since the wavelength of 20 kHz in air is
about 3/8",


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-4-
the implication is that an accurate 3D sound system can be implemented with an
array of microphones and loudspeakers spaced no more than 3/16" apart.
Extended
over all surfaces of a typical 9'x12' room, this works out to about 2.5
million
channels, a considerable improvement over an infinite number, but still
impractical at

this time. Still, it establishes the basic approach of using an array of
discrete
channels as spatial samples, from which the sound field can be recovered by
application of appropriate interpolation.

Once the sound field is characterized, it is possible in principle for a
decoder to
derive the optimal signal feed for any output loudspeaker. The channels
supplied to
such a decoder will be referred to herein variously as "cardinal,"
"transmitted," and

"input" channels, and any output channel with a location that does not
correspond to
the position of one of the cardinal channels will be referred to as an
"intermediate"
channel. An output channel may also have a location coincident with the
position of
a cardinal input channel.

It is therefore desirable to reduce the number of discrete channel spatial
samples, or cardinal channels. One possible basis for doing so is the fact
that, above
1500 Hz, the ear no longer follows individual cycles, only the critical band
envelope.
This might allow channel spacing commensurate with 1500 Hz, or about 3". This
would reduce the total for the 9'x12' room to about 6000 channels, a useful
saving of

about 2.49 million channels compared to the previous arrangement.

In any case, further reduction in the number of spatial sampling channels is
theoretically possible by appeal to psychoacoustic localization limits. The
horizontal
limit of resolution, for centered sounds, is about 1 degree of arc. The
corresponding
limit of vertical resolution is about 5 degrees. If this density is extended

appropriately around a sphere, the result will still be a few hundred to a few
thousand
channels.


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-5-
DISCLOSURE OF THE INVENTION
In accordance with the present invention, a process translates M audio input
channels representing a soundfield to N audio output channels representing the
same
soundfield, wherein each channel is a single audio stream representing audio
arriving

from a direction, M and N are positive whole integers, and M is at least two.
One or
more sets of output channels are generated, each set having one or more output
channels. Each set is associated with two or more spatially adjacent input
channels
and each output channel in a set is generated by a process that includes
determining a
measure of the correlation of the two or more input channels and the level

interrelationships of the two or more input channels.

In one aspect of the present invention, multiple sets of output channels are
associated with more than two input channels and the process determines the
correlation of input channels with which each set of output channels is
associated
according to a hierarchical order such that each set or sets is ranked
according to the

number of input channels with which its output channel or channels are
associated,
the greatest number of input channels having the highest ranking, and the
processing
processes sets in order according to their hierarchical order. Further
according to an
aspect of the present invention, the processing takes into account the results
of

processing higher order sets.

The playback or decoding aspects of the present invention assume that each of
the M audio input channels representing audio arriving from a direction was
generated by a passive-matrix nearest-neighbor amplitude-panned encoding of
each
source direction (i.e., a source direction is assumed to map primarily to the
nearest
cardinal channel or channels), without the requirement of additional side
chain

information (the use of side chain or auxiliary information is optional),
making it
compatible with existing mixing techniques, consoles, and formats. Although
such
source signals may be generated by explicitly employing a passive encoding
matrix,
most conventional recording techniques inherently generate such source signals


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-6-
(thus, constituting an "effective encoding matrix"). The playback or decoding
aspects of the present invention are also largely compatible with natural
recording
source signals, such as might be made with five real directional microphones,
since,
allowing for some possible time delay, sounds arriving from intermediate
directions

tend to map principally to the nearest microphones (in a horizontal array,
specifically
to the nearest pair of microphones).

A decoder or decoding process according to aspects of the present invention
may be implemented as a lattice of coupled processing modules or modular
functions
(hereinafter, "decoding modules"), each of which is used to generate one or
more

output channels (or, alternatively, control signals usable to generate one or
more
output channels) from the two or more of the closest spatially adjacent
cardinal
channels associated with the decoding module. The output channels represent
relative proportions of the audio signals in the closest spatially adjacent
cardinal
channels associated with the particular decoding module. As explained in more
detail below, the decoding modules are loosely coupled to each other in the
sense that
modules share nodes and there is a hierarchy of decoding modules. Modules are
ordered in the hierarchy according to the number of cardinal channels they are
associated with (the module or modules with the highest number of associated
cardinal channels is ranked highest). A supervisory routine function presides
over

the modules so that common node signals are equitably shared and higher-order
decoder modules may affect the output of lower-order modules.
Each decoder module may, in effect, include a matrix such that it directly
generates output signals or each decoder module may generate control signals
that
are used, along with the control signals generated by other decoder modules,
to vary

the coefficients of a variable matrix or the scale factors of inputs to or
outputs from a
fixed matrix in order to generate all of the output signals.

Decoder modules emulate the operation of the human ear to attempt to provide
perceptually transparent reproduction. Each decoder module may be implemented
as


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-7-
either a wideband or multiband structure or function, in the latter case with
either a
continuous filterbank, or a block-structure, for example, a transform-based
processor,
using, for example, the same essential processing in each band.

Although the basic invention relates generally to the spatial translation of M
input channels to N output channels, wherein M and N are positive whole
integers
and M is at least two, another aspect of this invention is that the quantity
of speakers
receiving the N output channels can be reduced to a practical number by
judicious
reliance upon virtual imaging, that is the creation of perceived sonic images
at
positions in space other than where a loudspeaker is located. The most common
use

of virtual imaging is in the stereo reproduction of an image part way between
two
speakers, by panning a mono signal between the channels. Virtual imaging is
not
considered a viable technique for group presentation with a sparse number of
channels, because it requires the listener to be equidistant from the two
speakers, or
nearly so. In movie theatres, for example, the left and right front speakers
are too far

apart to obtain useful phantom imaging of a center image to much of the
audience,
so, given the importance of the center channel as the source of much of the
dialog, a
physical center speaker is used instead.

However, as the density of the speakers is increased, a point will be reached
where virtual imaging is viable between any pair of speakers for much of the

audience, at least to the extent that pans are smooth; with sufficient
speakers, the
gaps between the speakers are no longer perceived as such. Such an array has
the
potential to be nearly indistinguishable from the 2 million array derived
earlier.

In order to test aspects of the present invention, we deployed a horizontal
array
of 5 speakers on each wall, 16 total allowing for common corner speakers, plus
a ring
of 6 speakers above the listener at a vertical angle of about 45 degrees, plus
a single

speaker directly above, total 23, plus a subwoofer/LFE channel, total 24, all
fed from
a PC set up for 24-channel playback. Although by current parlance this system
might


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-8-
be referred to as a 23.1 channel system, for simplicity it will be referred to
as a 24-
channel system herein.

FIG. 1 is a top plan view showing schematically an idealized decoding
arrangement in the manner of the just-described test arrangement. Five wide
range
horizontal cardinal channels are shown as squares 1', 3', 5', 9' and 13' on
the outer

circle. A vertical channel, perhaps derived from the five wide range cardinals
via
correlation or generated reverberation, or separately supplied, is shown as
the broken
square 23' in the center. The twenty-three wide range output channels are
shown as
numbered filled circles 1-23. The outer circle of sixteen output channels is
on a

horizontal plane, the inner circle of six output channels is forty-five
degrees above
the horizontal plane. Output channel 23 is directly above one or more
listeners. Five
two-input decoding modules are illustrated as arrows 24-28 around the outer
circle,
connected between each pair of horizontal cardinal channels. Five additional
two-
input vertical decoding modules are illustrated as arrows 29-33 connecting the

vertical channel to each of the horizontal cardinals. Output channel 21, the
elevated
center rear channel, is derived from a three-input decoding module illustrated
as
arrows between output channel 21 and cardinal channels 9, 13 and 23. Thus,
each
module is associated with a respective pair or trio of closest spatially
adjacent
cardinal channels. Although the decoding modules represented in FIG. 1 have
three,

four or five output channels, a decoding module may have any reasonable number
of
output channels. An output channels may be located intermediate to one or more
cardinal channels or at the same position as a cardinal channel. Thus, in the
FIG. 1
example, each of the cardinal channel locations is also an output channel. Two
or
three decoding modules share each input channel.

As will be discussed, a design goal of this invention is that the playback
processor should be capable in concept of working with an arbitrary number and
arrangement of speakers, so the 24-channel array will be used as an
illustrative but


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-9-
non-unique example of the density and arrangement required to achieve a
convincing
continuum perceived soundfield according to one aspect of the invention.

The desire to be able to use a large, and possibly user-selectable, number of
presentation channels raises the question of the number of discrete channels,
and/or
other information, that must be conveyed to the playback processor in order
for it to
derive, at least as one option, the twenty four channels described above.
Obviously,
one possible approach is simply to transmit twenty four discrete channels, but
aside
from the fact that it would likely be onerous for content producers to have to
mix that
many separate channels, and for a transmission medium to convey as many
channels,

it is preferred not to do so, as the 24-channel arrangement is merely one of
many
possible, and it is desired to allow for more or fewer presentation channels
from a
common transmitted signal array.

One way to recover output channels is to use formal spatial interpolation, a
fixed weighted sum of the transmitted channels for each output, assuming the
density
of such channels is sufficiently great to allow for that. However, this would
require

from thousands to millions of transmitted channels, analogous to the use of a
multi-
hundred-tap FIR filter to perform temporal interpolation of a single signal.
Reduction to a practical number of transmitted channels requires the
application of
psychoacoustic principles and more aggressive, dynamic interpolation from far
fewer

channels, still leaving unanswered the question of just how many channels are
needed to impart the percept of a complete soundfield.

This question was addressed by an experiment performed by the present
inventor some years ago, and recently replicated by another. The basis for the
earlier
experiment, at least, was the observation that conventional 2-channel binaural

recording is capable of reproducing a realistic left/right image spread, but
results in
erratic front/back localization, owing in part to the imperfection of any HRTF
employed, and the lack of head motion cues. To circumvent this drawback, a
dual-
binaural (4-channel) recording was made, using two pairs of directional
microphones


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-10-
spaced commensurate to the size of the human head. One pair faced forward, the
other to the rear. The resulting recording was played over four speakers
spaced close
to the head, to mitigate acoustic cross coupling effects. This arrangement
provided
realistic left/right timing and amplitude localization cues from each pair of
speakers,

plus unambiguous front/back information from the corresponding discrete
positions
of the microphones and speakers. The result was a singularly compelling
surround
sound presentation that lacked only a viable representation of height
information. A
recent experiment of another added a center front channel and two height
channels,
and was reported to be similarly realistic, perhaps even enhanced by the
addition of
height information.

Therefore, from both psychoacoustic considerations and empirical evidence, it
appears that the relevant perceptual information can be conveyed in perhaps 4
to 5
"binaural-like" horizontal channels, plus perhaps one or more vertical
channels.
However, the signal crossfeed characteristic of binaural channel pairs makes
them

unsuitable for direct playback to a group via loudspeakers, since there is
very little
separation at midrange and low frequencies. So rather than introducing the
crossfeed
at the encoder (as is done for a binaural pair) only to have to undo it in the
decoder, it
is simpler and more direct to keep channels isolated, and to mix output
channel

signals from the nearest transmitted channels. Not only does this allow for
direct
playback through a like number of speakers without a decoder, if desired, plus
optional downmix to fewer channels with a passive matrix decoder, but it
essentially
corresponds to the existing standard arrangement of 5.1 channels, at least in
the
horizontal plane. It is also largely compatible with natural recordings, such
as might
be made with five real directional microphones, since, allowing for some
possible

time delay, sounds arriving from intermediate directions will tend to map
principally
to the nearest microphones (in a horizontal array, specifically to the nearest
pair of
microphones).


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-11-
Thus, from a perceptual standpoint, it should be possible for a channel

translation decoder to accept a standard 5.1 channel program and convincingly
present it through an arbitrary number of horizontally arrayed speakers,
including the
sixteen horizontal speakers of the twenty-four-channel array described
earlier. With

the addition of a vertical channel, such as is sometimes proposed for a
digital cinema
system, it should be possible to feed the entire twenty-four-channel array
with
individually derived, perceptually valid signals that together impart a
continuum
soundfield percept at most listening positions. Of course, if there is access
to the fine
grain source channels at the encoding site, additional information about them
might

be used to actively alter the encode matrix scale factors to pre-compensate
for
decoder limitations, or might simply be included as additional side-chain
(auxiliary)
information, perhaps similar to the coupling coordinates used in AC-3 (Dolby
Digital) multichannel coding, but perceptually, such extra information should
not be
necessary; and practically, requiring the inclusion of such information is
undesirable.

The intended operation of the channel translation decoder is not limited to
operation
with 5.1 channel sources, and may use fewer or more, but there is at least
some
justification to the belief that credible performance can be obtained from 5.1
channel
sources.

This still leaves unanswered the question of just how to extract the

intermediate output channels from a sparse array of transmitted channels. The
solution proposed by one aspect of the present invention is to exploit again
the notion
of virtual imaging, but in a somewhat different way. It was previously noted
that
virtual imaging is not viable for group presentation with sparse speaker
arrays
because it required the listener to be nearly equidistant from each speaker.
But it will

work, after a fashion, for a listener who is fortuitously so placed, allowing
the percept
of intermediate phantom channels for signals that have been amplitude panned
between the nearest real output channels. It is therefore proposed in one
aspect of the
present invention that the channel translation decoder consist of a series of
modular


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-12-
interpolating signal processors, each in effect emulating an optimally placed
listener,
and each functioning in a manner analogous to the human auditory system to
extract
what would otherwise be virtual images from amplitude-panned signals, and feed
them to real loudspeakers; the speakers preferably arrayed densely enough that

natural virtual imaging can fill in the remaining gaps between them.

In general, each decoding module derives its inputs from the nearest
transmitted cardinal channels, which, for example, for a canopy (overhead)
array of
speakers may be three or more cardinal channels. One way of generating output
channels involving more than two cardinal channels might be to employ a series
of

pair-wise operations, with, e.g., outputs of some pair-wise decoding modules
feeding
the inputs of other modules. However, this has two drawbacks. One is that
cascading decoding modules introduces multiple cascaded time constants,
resulting
in some output channels responding more quickly than others, causing audible
position artifacts. The second drawback is that pair-wise correlation alone
can only

place intermediate or derived output channels along the line between the pair;
use of
three or more cardinals removes this restriction. Consequently, an extension
to
common pair-wise correlation has been developed to correlate three or more
output
signals; this technique is described below.

Horizontal localization in the human ear is predicated primarily upon two
localization cues: interaural amplitude differences and interaural time
differences.
The latter cue is only valid for signal pairs in near time alignment, 600
microseconds or so. The practical effect is that phantom intermediate images
will
only occur at positions corresponding to a particular left/right amplitude
difference,
assuming the common signal content in the two real channels is correlated, or
nearly

so. (Note: two signals can have cross correlation values that span from +1 to -
1.
Fully correlated signals (correlation = 1) have the same waveform and time
alignment, but may have different amplitudes, corresponding to off-center
image
positions.) As the correlation of a signal pair diminishes below 1, the
perceived


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
- 13-

image will tend to spread, until, for two uncorrelated signals, there will be
no
intermediate image, only separate and distinct left and right images. Negative
correlations are usually treated by the ear as similar to uncorrelated signal
pairs,
although the two images may appear to be spread wider. The correlations are
carried

out on a critical band basis, and above about 1500 Hz, the critical band
signal
envelopes are used instead of the signals themselves, to save human
computational
requirements (MIPS).
Vertical localization is a little more complex, relying on HRTF pinna cues and
dynamic modulation of the horizontal cues with head motion, but the final
effect is
similar to horizontal localization with respect to panned amplitudes, cross
correlation, and corresponding perceived image position and fusion. Vertical
spatial
resolution is, however, less precise than horizontal resolution, and does not
require as
dense an array of cardinal channels for adequate interpolation performance.

An advantage of using directional processors that emulate the operation of the
human ear is that any imperfections or limitations of the signal processing
should be
perceptually masked by like imperfections and limitations of the human ear,
allowing
for the possibility that the system will be perceived as nearly
indistinguishable from
the original full continuum presentation.
Although the present invention is designed to make effective use of however
many or few output channels are available (including playback via as many
loudspeakers as there are input channels with no decoding, and passive mixdown
to
fewer channels, including mono, stereo and surround compatible Lt/Rt), it is
preferably intended to employ a large and somewhat arbitrary, but nonetheless
practical number of presentation channels/loudspeakers, and use as source
material a

similar or smaller number of encoded channels, including existing 5.1 channel
surround tracks, and possible next-generation 11- or 12-channel digital cinema
soundtracks.


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-14-
Implementations of the present invention desirably should exhibit four

principles: error containment, dominant containment, constant power, and
synchronized smoothing.

Error containment refers to the notion that, given the likelihood of decoding
errors, the decoded position of each source should be in some reasonable sense
near
its true, intended direction. This mandates a certain degree of conservatism
in
decoding strategy. Faced with the prospect of more aggressive decoding
accompanied by possibly greater spatial disparity in the event of errors, it
is usually
preferable to accept less precise decoding in exchange for assured spatial

containment. Even in situations in which more precise decoding can confidently
be
applied, it may be unwise to do so if there is a likelihood that dynamic
signal
conditions will require the decoder to ratchet between aggressive and
conservative
modes, resulting in audible artifacts.
Dominant containment, a more constrained variant of error containment, is the
requirement that a single well-defined dominant signal should be panned by the
decoder to only nearest neighbor output channels. This condition is necessary
to
maintain image fusion for dominant signals, and contributes to the perceived
discreteness of a matrix decoder. While a signal is dominant, it is suppressed
from
other output channels, either by subtracting it from the associated cardinal
signals, or

by directly applying to other output channels matrix coefficients
complementary to
those used to derive the dominant signal ("anti-dominant
coefficients/signal").
Constant power decoding requires not only that the total decoded output power

be equal to the input power, but also equates the input/output power of each
channel
and directional signal encoded in the conveyed cardinal array. This minimizes
gain-
pumping artifacts.
Synchronized smoothing applies to systems with signal dependent smoothing
time constants, and requires that if any smoothing network within a decoding
module
is switched to a fast time constant mode, all other smoothing networks within
the


CA 02437764 2011-01-21
73221-64

-15-
module be similarly switched. This is to avoid having a newly dominant
directional
signal appear to slowly fade/pan from the previous dominant direction.

According to one aspect of the present invention, there is provided a
method for translating M audio input channels representing a soundfield to N
audio output channels representing the same soundfield, wherein each channel
is
a single audio stream representing audio arriving from a direction, M and N
are
positive whole integers, and M is a positive integer equal to two or more,
comprising a plurality of decoding modular functions, wherein multiple modular
functions share ones of said M input channels, and each modular function
either
includes a matrix that generates one or more output channels each constituting
a
subset of said N channels and controls its matrix in response to two or more
of the
closest spatially adjacent cardinal channels associated with the decoding
modular
function, or generates control signals in response to two or more of the
closest
spatially adjacent cardinal channels associated with the decoding modular
function, wherein the control signals are used, along with control signals
generated by other decoding modular functions, to vary the coefficients of a
variable matrix to generate all of the output channels, or generates control
signals
in response to two or more of the closest spatially adjacent cardinal channels
associated with the decoding modular function, wherein the control signals are
used, along with control signals generated by other decoder modular functions
to
vary the scale factors of inputs to or outputs from a fixed matrix to generate
all of
the output channels.


CA 02437764 2011-01-21
73221-64

- 15a -
DESCRIPTION OF THE DRAWINGS
FIG. I is a schematic drawing showing a top plan view of an idealized decoder
arrangement.

BEST MODE FOR CARRYING OUT THE INVENTION
Decoding Module

Because encoding any source direction is assumed to map primarily to the
nearest cardinal channels, channel translation decoding is based on a series
of semi-
autonomous decoding modules which in a general sense recover output channels,
particularly intermediate output channels, each usually from a subset of all
the
transmitted channels, in a fashion similar to the human ear.

In a fashion analogous to the human ear, the operation of the decoding module
is based on a combination of amplitude ratios, to determine the nominal
ongoing
primary direction, and cross correlation, to determine the relative width of
the image.
Using control information derived from the amplitude ratios and cross
correlation, the processor then extracts output channel audio signals. Since
this is
best done on a linear basis, to avoid generation of distortion products, the
decoder

forms weighted sums of cardinal channels containing the signal of interest.
(As
explained below, it may also be desirable to include information about non-
neighbor
cardinals in the calculation of the weighted sum.) This limited but dynamic
form of
interpolation is more commonly referred to as matrixing. If, in the source,
the
desired signal is mapped (amplitude panned) to the nearest M cardinal
channels, then
the problem is one of M:N matrix decoding. In other words, the output channels
represent relative proportions of the input channels.


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-16-
Especially in the case of two-input decoding modules, this is much like the

issue addressed by active 2:N matrix decoders, such as the now classic Dolby
Pro
Logic matrix decoder, with pairwise decoding module inputs corresponding to
the
Lt/Rt encoded signals,

Note: The outputs of a 2:N matrix decoder are sometimes referred to as
cardinal channels. This document, however, uses "cardinal" to refer to the
input
channels of the channel translation decoder.
There is, however, at least one significant difference between prior art
active
2:N decoders and the operation of a decoding module according to the present

invention. While the former use left/right amplitudes to indicate left/right
position,
as postulated as well for the channel translation decoder, they also use
interchannel
phase to indicate front/back position, relying specifically on the ratio of
sum/difference of the Lt/Rt encoded channels.
There are two problems with such active 2:N decoder arrangements. One is
that fully correlated (frontal), but off-center signals, for example, will
result in a
sum/difference ratio of less than infinity, incorrectly indicating a less-than-
full-
frontal position (similarly for full anti-correlated off-center rear signals).
The result
is a somewhat warped decoding space. The second drawback is that the
positional
mapping is many-to-one, introducing inherent decoding errors. For example, in
a

4:2:4 matrix system, an uncorrelated Left-In and Right-In signal pair with no
Front-
In or Rear-In will map to the same net, uncorrelated Lt/Rt pair as will an
uncorrelated
Front-In/Back-In pair, with no Left-In/Right-In, or for that matter from all
four inputs
uncorrelated. The decoder, faced with an uncorrelated Lt/Rt signal pair, has
no

choice but to "relax the matrix", that is use a passive matrix that
distributes sound to
all output channels. It is incapable of decoding to a simultaneous Left-
Out/Right-Out
only, or Front-Out/Rear-Out only signal array.
The underlying problem is that the use of interchannel phase to code
front/back position in N:2:N matrixing systems runs counter to the operation
of the


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-17-
human ear, which does not use phase to judge front/back position. The present
invention works best with at least three non-collinear cardinal channels, so
that
front/back position is indicated by the assumed directions of the cardinal
channels,
without assigning different directions depending on their relative phases or
polarities.

As such, a pair of uncorrelated or anti-correlated channel translation
cardinal signals
unambiguously decodes to isolated cardinal-output channel signals, with no
intermediate signal and no "rearward" direction indicated. (This, by the way,
avoids
the unfortunate "center pileup" effect in active 2:N decoders, in which
uncorrelated
Left-In and Right-In signals are presented with reduced separation because the

decoder feeds sum and difference of these signals to center and surround
channels.)
Of course, it is possible in principle to spatially expand a Lt/Rt signal pair
by
cascading a 2:N decoder, N = 4, or 5, with an N:M channel translation system,
but in
that case any limitations of the 2:N decoder, such as center pileup, will be
carried
over to the channel multiplied outputs. It is also possible to combine these
functions

into a channel translation decoder configured to accept 2-channel Lt/Rt
signals and,
in such cases, modify its behavior to interpret negative correlation signals
as having
rearward orientation, leaving the rest of the processing largely intact.
However, even
in that case, decoding ambiguities resulting from having only two transmitted

channels would remain.

Thus, each decoding module, especially those with two input channels,
resembles a prior art active 2:N decoder, with the front/back detection
disabled or
modified, and an arbitrary number of output channels. Of course, it is a
mathematical impossibility to use matrixing to uniquely extract a larger
number of
channels from a smaller number, as this basically involves N linear equations
with M

unknowns, M greater than N. Therefore, it is to be expected that the decoding
module may at times exhibit less than perfect channel recovery in the presence
of
multiple active source direction signals. However, the human auditory system,
limited to using just two ears, will tend to be subject to the same
limitations, allowing


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-18-
the system to be perceived as discrete, even with all channels operating.
Isolated
channel quality, with other channels muted, is still a consideration to
accommodate
listeners that may be situated near one speaker.

To be sure, the ear is operating on a frequency-dependent basis, but given
that
most sonic images will be similarly correlated at all frequencies, together
with the
successful empirical experience with Pro Logic decoders as a wideband system,
it is
to be expected that a wideband channel translation system may also be capable
of
satisfactory performance in some applications. Multiband channel translation
decoding should also be possible, using similar processing on a band-by-band
basis,

and using the same encoded signal in each case, so the number and bandwidth of
individual bands can be left as a free parameter to the decoder implementer.
Although multiband processing is likely to require higher MIPS than wideband
processing, the computational demands may not be that much higher if the input
signals are divided into data blocks and the process is carried out on a block
basis.

Before describing an algorithm usable by the decoding modules of the present
invention, consideration is first given to the problem of shared nodes.

Shared Nodes
If the cardinal channel groups used by the decoding modules were all
independent, then the decoding modules themselves could be independent,

autonomous entities. Such is, however, not usually the case. A given
transmitted
channel will in general share separate output signals with two or more
neighboring
cardinal channels. If independent decoding modules are used to decode the
array,
each will be influenced by output signals of neighboring channels, resulting
in
possibly serious decoding errors. In effect, two output signals of neighboring

decoding modules will "pull", or gravitate, toward each other, because of the
increased level of the common cardinal node containing both signals. If, as is
likely
to be the case, the signals are dynamic, so too will be the amount of
interaction,
leading to signal dependent dynamic positioning errors of a possibly highly


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-19-
objectionable nature. This problem does not arise with Pro Logic and other
active
2:N decoding, since they use only a single, isolated channel pair as the
decoder input.

Thus, it is necessary to compensate for the "shared node" effect. One possible
way to do so would be to subtract one recovered signal from the common node
before trying to recover the output signal of an adjacent decoding module
sharing the
common node. This is often not possible, so as a fall back, each decoding
module
estimates the amount of common output signal energy present at its input
channels,
and a supervisor routine then informs each module of its neighbors' output
signal
energy estimates.

Pair-wise calculation of common energy
For example, suppose cardinal channel pair A/B contains a common signal X
along with individual, uncorrelated signals Y and Z:

A = 0.707X+Y
B = 0.707X + Z

where the scalefactors of 0.707 = 0.5 provide a power preserving mapping to
the
nearest neighbor cardinal channels.

RMSEnergy(A) = JA2at = A2 = (.707X +Y)2 = (0.5X2 +0.707XY+Y2)
= 0.5X2 +0.707XY+Y2

Because X and Y are uncorrelated,
XY=0

So:


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-20-
A2 = 0.5X2 +Y2

i.e., Because X and Y are uncorrelated, the total energy in cardinal channel A
is the
sum of the energies of signals X and Y.


Similarly:
B2 =0.5X2+Z2

Since X, Y, and Z are uncorrelated, the averaged cross-product of A and B is:
AB = 0.5X2

So, in the case of an output signal shared equally by two neighboring cardinal
channels which may also contain independent, uncorrelated signals, the
averaged
cross-product of the signals is equal to the energy of the common signal
component

in each channel. If the common signal is not shared equally, i.e., it is
panned toward
one of the cardinals, the averaged cross-product will be the geometric mean
between
the energy of the common components in A and B, from which individual channel
common energy estimates can be derived by normalizing by the square root of
the
ratio of the channel amplitudes. Actual time averages are computed with a
leaky

integrator having a suitable decay time constant, to reflect ongoing activity.
The time
constant smoothing can be elaborated with nonlinear attack and decay time
options,
and in a multiband system, may be scaled with frequency.

Higher order calculation of common energy
In order to derive the common energy of decoding modules with three or more
inputs, it is necessary to form averaged cross-products of all the input
signals.


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-21-
Simply performing pairwise processing of the inputs will fail to differentiate
between
separate output signals between each pair of inputs and a signal common to
all.

Consider, for example, three cardinal channels, A, B, and C, made up of
uncorrelated signals W, Y, Z, and common signal X:

A=X+W
B=X+Y
C=X+Z
If the average cross-product is calculated, all terms involving combinations
of

W, Y, and Z will cancel, as in the second order calculation, leaving the
average of
X3:

ABC=X3
Unfortunately, if X is a zero mean time signal, as expected, then the average
of
its cube is zero. Unlike averaging X2, which is positive for any nonzero value
of X,

X3 has the same sign as X, so the positive and negative contributions will
tend to
cancel. Obviously, the same holds for any odd power of X, corresponding to an
odd
number of module inputs, but even exponents greater than 2 can also lead to
erroneous results; for example four inputs with components (X, X, -X, -X) will
have
the same product/average as (X, X, X, X).

This problem has been resolved by employing a variant of the averaged
product technique. Before being averaged, the sign of the each product is
discarded
by taking the absolute value of the product. The signs of each term of the
product are
examined. If they are all the same, the absolute value of the product is
applied to the
averager. If any of the signs are different from the others, the negative of
the

absolute value of the product is averaged. Since the number of possible same-
sign
combinations may not be the same as the number of possible different-sign


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-22-
combinations, a weighting factor comprised of the ratio of the number of same
to
different sign combinations is applied to the negated absolute value products
to
compensate. For example, a three-input module has two ways for the signs to be
the
same, out of eight possibilities, leaving six possible ways for the signs to
be different,

resulting in a scale factor of 2/6 = 1/3. This compensation causes the
integrated or
summed product to grow in a positive direction if and only if there is a
signal
component common to all inputs of a decoding module.

However, in order for the averages of different order modules to be
comparable, they must all have the same dimensions. A conventional second-
order
correlation involves averages of two-input multiplications and hence of
quantities

with the dimensions of energy or power. Thus the terms to be averaged in
higher
order correlations must be modified also to have the dimensions of power. For
a kth
order correlation, the individual product absolute values must therefore be
raised to
the power 2/k before being averaged.

Of course, regardless of the order, the individual input node energies of a
module, if needed, can be calculated as the average of the square of the
corresponding node signal, and need not be first raised to the kth power and
then
reduced to a second order quantity.


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-23-
Shared nodes: Neighbor levels

By using averaged squares and modified cross-products of cardinal channel
signals, the amount of common output channel signal energy can be estimated.
The
above example involved a single interpolation processor, but if one or more of
the

A/B(/C) nodes were common to another module with its own common signal
component, uncorrelated with any other signals, then the averaged cross-
product
computed above would not be affected, making the calculation inherently free
of any
image pulling effects. (Note: if the two output signals are not uncorrelated,
they will
tend to pull the decoders some, but should have a similar effect on the human
ear, so

again system operation should remain faithful to human audition.)

Once each decoding module has computed the estimated common output
channel signal energy at each of its cardinal nodes, the supervisor routine
function
can inform neighboring modules of each others' common energy, at which point
the
extraction of the output channel signals can proceed as described below. The

calculation of the common energy used by a module at a node must take into
account
the hierarchy of possibly overlapping modules of different order, and subtract
the
common energy of a higher order module from the estimated common energy of any
lower order module sharing the same nodes.

For example, suppose there are two adjacent cardinal channels A and B,
representing two horizontal directions, plus a cardinal channel C,
representing a
vertical direction, and further suppose the existence of an intermediate or
derived
output channel representing an interior direction (i.e. one within the limits
of A, B
and C), with signal energy X2. The common energy of a three-input module, with
inputs (A, B, C), will be X2, but so will the common energy of two-input
modules (A,

B), (B, C), and (A, Q. If the common energy of A-connected modules (A, B, C),
(A,
B), and (A, C) is simply added, the result is 3X2, instead of X2. In order for
the
calculation of common node energy to be correct, the common energy of each
higher
order module is first subtracted from the estimate of the common energy of
each


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-24-
overlapping lower-level module, so the common energy X2 of higher order module
(A, B, C) is subtracted from the common energy estimates of the two two-input
modules, resulting in 0 in each case, and making the net common energy
estimate at
node A equal to X2 + 0 + 0 = X2.

Output Channel Signal Extraction
As has been noted, the process of recovering the ensemble of output channels
from the transmitted channels. in a linear fashion is basically one of
matrixing, that is
forming weighted sums of the cardinal channels to derive output channel
signals.
The optimal choice of matrix scale factors is generally signal dependent.
Indeed, if

the number of currently active output channels is equal to the number of
transmitted
channels (but representing different directions), making the system exactly
constrained, it is mathematically possible to compute an exact inverse of the
effective
encoding matrix, and recover isolated versions of the source signals. Even if
the
number of active output channels is greater than the number of cardinals, it
may still

be possible to compute a matrix pseudo-inverse.

Unfortunately, there are problems with this approach, not the least of which
is
that it is computationally demanding, especially on a multiband basis, and
oriented
toward high accuracy floating point implementation. Even though intermediate
signals are assumed to be panned to nearest neighbor cardinal channels, a

mathematical inverse or pseudo-inverse of the effective encoding matrix will
in
general involve contributions from all cardinal channels to each output
channel,
because of the node sharing effect. If there are any imperfections in the
decoding, as
indeed there inevitably will be, a cardinal channel signal could be reproduced
from
an output channel far removed from it spatially, which is highly undesirable.
In

addition, pseudo-inverse calculations tend to produce minimum-RMS-energy
solutions, which maximally spread the sound around, providing minimum
separation;
this is quite the opposite of the intention.


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-25-
So, in order to implement a practical, fault-tolerant decoder in which spatial

decoding errors are inherently contained, the same modular structure as was
used for
signal detection is employed for signal extraction.

Following are details of the extraction process by which output signals are
recovered by a decoding module. Note that the effective position of each
output
channel connected to the module is assumed to be indicated by the amplitude
ratio
that would otherwise be needed to pan a signal to that physical location,
i.e., the ratio
of the effective matrix encoding coefficients corresponding to that direction.
To
avoid divide-by-zero problems, ratios are typically calculated as the quotient
of one

channel's matrix coefficient over the RMS sum of all of that input channels'
matrix
coefficients (usually 1). For example, in a two-input module with inputs L and
R, the
energy ratio used would be the L energy over the sum of the L and R energies
("L-
ratio"), which has a well-behaved range of 0 to 1. If the two-input decoding
module
has five output channels with effective encoding matrix coefficient pairs of
(1.0, 0),

(0.89, 0.45), (0.71, 0.71), (0.45, 0.89) and (0, 1.0), the corresponding L-
ratios are 1.0,
0.89, 0.71, 0.45, and 0, since each scale factor pair has an RMS sum of 1Ø

From the signal energy of each input node (cardinal channel) of the decoding
module is subtracted any node-sharing signal energy claimed by neighboring
decoding modules, resulting in normalized input signal power levels used for
the

remainder of the calculation.

The dominant direction indicator is calculated as the vector sum of the
cardinal
directions, weighted by the relative energy. For a two input module, this
simplifies
to being the L-ratio of the normalized input signal power levels.

The output channels bracketing the dominant direction are determined by
comparing the dominant direction L-ratio of step two, to the L-ratios of the
output
channels. For example, if the L-ratio of the above five-output-decoding-module
inputs is 0.75, the second and third output channels bracket dominant signal
direction, since 0.89 > 0.75 > 0.71.


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-26-
Panning scale factors to map the dominant signal onto the nearest bracketing

channels are calculated from the ratio of the anti-dominant signal levels of
the
channels. The anti-dominant signal associated with a particular output channel
is the
signal that results when the corresponding decoding module's input signals are
matrixed with the output channel's anti-dominant matrix scale factors. An
output
channel's anti-dominant matrix scale factors are those scale factors with RMS
sum =
1.0 which result in zero output when a single dominant signal is panned to the
output
channel in question. If an output channel's encode matrix scale factors are
(A, B),
then the anti-dominant scale factors of the channel are just (B, -A).

Proof
If a single dominant signal is panned to an output channel with encode scale
factors (A, B), then the signal must have amplitudes (kA, kB), k the overall
amplitude of the signal. Then the anti-dominant signal for that channel is (kA
* B -
kB*A)=0.
So, if a dominant signal consists of two-input module input signals (x(t),
y(t))
with input amplitudes normalized to RMS=1 (X, Y), the extracted dominant
signal
will be dom(t) = Xx(t) + Yy(t). If the position of this signal is bracketed by
output
channels having matrix scale factors (A, B) and (C, D) respectively, the
dominant
signal scale factor scaling dom(t) for the former channel will be:


SF(A,B) = sqrt ((DX - CY) / ((DX - CY) + (BX - AY))),

while the equivalent dominant signal scale factor for the latter channel will
be:
SF(C,D) = sqrt ((BX - AY) / ((DX - CY) + (BX - AY))).


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-27-
As the dominant direction is panned from one output channel to the other,

these two scale factors move in opposite directions between zero and one with
constant power sum.

The anti-dominant signal is calculated and panned with suitable gain scaling
to
all non-dominant channels. The anti-dominant signal is a matrixed signal
lacking
any of the dominant signal. If the inputs to a decoding module are (x(t),
y(t)) with
normalized amplitudes (X, Y), the dominant signal is Xx(t)+Yy(t) and the anti-
dominant signal is Yx(t)-Xy(t), irrespective of the positions of the non-
dominant
output channels.

In addition to the dominant/anti-dominant signal distribution, a second signal
distribution is calculated, using the "passive" matrix, which is basically the
output
channel matrix scale factors already discussed, scaled to preserve power.

The cross correlation of the decoding module input signals is calculated as
the
averaged cross-product of the input signals divided by the square root of the
product
of the normalized input levels.

Returning to details of the extraction process, the final output signals are
then
calculated as a weighted crossfade sum of the dominant and passive signal
distributions, using the decoding module's input signal cross-correlation to
derive the
crossfade factor. For correlation= 1, the dominant/anti-dominant distribution
is used

exclusively. As the correlation diminishes, the output signal array is
broadened by
cross-fading to the passive distribution, reaching completion at a low
positive value
of correlation, typically 0.2 to 0.4, depending on the number of output
channels
connected to the decoding module. As the correlation falls further, toward
zero, the
passive amplitude output distribution is progressively bowed outward, reducing
the

output channel levels, emulating the response of the human ear to such
signals.
Vertical Processing
Most of the processing described so far applies to the extraction of output
channel signals from neighboring cardinal channels, regardless of the
direction of the


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-28-
output and cardinal channels. However, because of the horizontal orientation
of the
ears, human auditory localization tends to be less sensitive to interchannel
correlation
in the vertical direction than horizontally. To remain faithful to the
operation of the
human ear, it may be desirable to relax the correlation constraint in
interpolation

processors using vertically-oriented input channels, such as processing the
correlation signal with a warping function before otherwise applying it.
However, it
may be that use of the same processing as for horizontal channels will not
involve
any audible penalty, which will simplify the structure of the overall decoder.

Strictly speaking, vertical information includes both sound from above and

below, and the decoder structure described will work equally well with either,
but in
practice there is little natural sound normally perceived as coming from
below, so
such processing and channels can probably be omitted without seriously
compromising the perceived spatial fidelity of the system.

That notion may have practical significance in the application of channel
translation to existing 5.1 channel surround material, which, of course, lacks
any
vertical channel. However, it may contain vertical information, such as fly-
overs,
which are panned across many or all of the horizontal channels. Thus, it
should be
possible to extract a virtual vertical channel from such source material, by
looking
for correlations among non-neighboring channels or groups of channels. Where
such

correlations exist, they will usually indicate the presence of vertical
information from
above, rather than below the listener. In some instances, it may also be
possible to
derive virtual vertical information from a reverberation generator, perhaps
keyed to a
model of the intended listening environment. Once the virtual vertical channel
is
extracted or derived from the 5.1-channel source, the expansion to larger
numbers of

channels, such as the 24-channel arrangement described earlier, can proceed as
if a
real vertical channel had been supplied.


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-29-
Directional Memory

One respect in which the operation of the decoding module control generation
described above is similar to a 2:N active decoder such as a Pro Logic decoder
is that
the only "memory" in the process is in the smoothing networks, which derive
the
basic control signals. At any one point in time, there is only one dominant
direction
and one value of input correlation; and signal extraction proceeds directly
from these
signals.

However, particularly in complex acoustical environments (like the archetypal
cocktail party), the human ear exhibits a certain degree of positional memory,
or

inertia, in that a briefly dominant sound from a given direction that is
clearly
localized will result in other, less distinctly localizable sounds from that
general
direction to be perceived as coming from the same source.

It is possible to emulate this effect in the decoding modules (and, indeed, in
Pro Logic decoding as well) by adding an explicit mechanism to keep track of

recently dominant directions and, during intervals of directionally ambiguous
signal
conditions, weight the output signal distribution toward recently dominant
directions.
This can enhance the perceived reproduced discreteness and stability of
complex
signal arrays.

Modified Correlation and Selective Channel Mixing

As described, the spreading determination of each decoding module is based
on the coincident cross correlation of its input signals. This may
underestimate the
amount of output signal content under some conditions. This will occur, for
example, with a naturally recorded signal in which non-centered directions
have
slightly different arrival times, along with unequal amplitudes, resulting in
a reduced

correlation value. The effect may be exaggerated if wide-spaced microphones
are
used, with commensurately elongated interchannel delays. To compensate, the
correlation calculation can be extended to cover a range of interchannel time
delays,
at the expense of slightly higher processing MIPS requirements. Also, since
the


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-30-
neurons on the auditory nerve have an effective time constant of about 1
msec., more
realistic correlation values may be obtained by first smoothing the rectified
audio
with a smoother having a 1 msec. time constant.

In addition, if a content producer has an existing 5.1 channel program with
strongly uncorrelated channels, the evenness of the spread when processed with
a
channel translation decoder can be increased by slightly mixing adjacent
channels,
thereby increasing the correlation, which will cause the channel translation
decoding
module to provide a more even spread among its intermediate output channels.
Such
mixing can be done selectively, for example leaving the center front channel
signal

unmixed, to preserve the compactness of the dialog track.
Loudness Compression/Expansion
When the encoding process involves mixing a larger number of channels to a
smaller number, there is a potential for clipping of the encoded signal if
some form
of gain compensation is not provided. This problem exists as well for
conventional

matrix encoding, but is potentially of greater significance for channel
translation,
because the number of channels being mixed to a given output channel is
greater. To
avoid clipping in such cases, an overall gain scale factor is derived by the
encoder
and conveyed in the encoded bitstream to the decoder. Normally, this value is
0 dB,
but can be set to a nonzero attenuating value by the encoder to avoid
clipping, with

the decoder providing an equivalent amount of compensating gain.

If the decoder is used to process an existing multichannel that lacks such a
scale factor program (e.g., an existing 5.1 channel soundtrack), it could
optionally
use a fixed scale factor with an assumed value (presumably 0 dB), or apply an
expansion function based on signal level and/or dynamics, or possibly make use
of

available metadata, such as a dialog normalization value, to adjust the
decoder gain.
The present invention and its various aspects may be implemented in analog
circuitry, or more probably as software functions performed in digital signal
processors, programmed general-purpose digital computers, and/or special
purpose


CA 02437764 2003-08-05
WO 02/063925 PCT/US02/03619
-31-
digital computers. Interfaces between analog and digital signal streams may be
performed in appropriate hardware and/or as functions in software and/or
firmware.

Representative Drawing
A single figure which represents the drawing illustrating the invention.
Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee  and Payment History  should be consulted.

Administrative Status

Title Date
Forecasted Issue Date 2012-04-10
(86) PCT Filing Date 2002-02-07
(87) PCT Publication Date 2002-08-15
(85) National Entry 2003-08-05
Examination Requested 2007-02-07
(45) Issued 2012-04-10
Expired 2022-02-07

Abandonment History

There is no abandonment history.

Payment History

Fee Type Anniversary Year Due Date Amount Paid Paid Date
Registration of a document - section 124 $100.00 2003-08-05
Application Fee $300.00 2003-08-05
Maintenance Fee - Application - New Act 2 2004-02-09 $100.00 2004-01-07
Maintenance Fee - Application - New Act 3 2005-02-07 $100.00 2005-01-13
Maintenance Fee - Application - New Act 4 2006-02-07 $100.00 2006-01-10
Maintenance Fee - Application - New Act 5 2007-02-07 $200.00 2007-01-05
Request for Examination $800.00 2007-02-07
Maintenance Fee - Application - New Act 6 2008-02-07 $200.00 2008-01-08
Maintenance Fee - Application - New Act 7 2009-02-09 $200.00 2009-01-28
Maintenance Fee - Application - New Act 8 2010-02-08 $200.00 2010-01-20
Maintenance Fee - Application - New Act 9 2011-02-07 $200.00 2011-01-18
Final Fee $300.00 2011-12-22
Maintenance Fee - Application - New Act 10 2012-02-07 $250.00 2012-01-18
Maintenance Fee - Patent - New Act 11 2013-02-07 $250.00 2013-01-17
Maintenance Fee - Patent - New Act 12 2014-02-07 $250.00 2014-01-17
Maintenance Fee - Patent - New Act 13 2015-02-09 $250.00 2015-02-02
Maintenance Fee - Patent - New Act 14 2016-02-08 $250.00 2016-02-01
Maintenance Fee - Patent - New Act 15 2017-02-07 $450.00 2017-02-06
Maintenance Fee - Patent - New Act 16 2018-02-07 $450.00 2018-02-05
Maintenance Fee - Patent - New Act 17 2019-02-07 $450.00 2019-02-04
Maintenance Fee - Patent - New Act 18 2020-02-07 $450.00 2020-01-22
Maintenance Fee - Patent - New Act 19 2021-02-08 $459.00 2021-01-20
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
DOLBY LABORATORIES LICENSING CORPORATION
Past Owners on Record
DAVIS, MARK FRANKLIN
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Abstract 2003-08-05 2 65
Claims 2003-08-05 2 53
Drawings 2003-08-05 1 15
Description 2003-08-05 31 1,493
Representative Drawing 2003-08-05 1 14
Cover Page 2003-10-06 1 41
Claims 2004-02-17 3 106
Claims 2007-02-07 2 48
Description 2011-01-21 32 1,529
Representative Drawing 2012-03-13 1 8
Cover Page 2012-03-13 2 43
PCT 2003-08-05 1 27
Assignment 2003-08-05 7 338
PCT 2003-08-06 7 298
Prosecution-Amendment 2004-02-17 4 132
Prosecution-Amendment 2007-02-07 4 94
Prosecution-Amendment 2010-12-14 2 40
Prosecution-Amendment 2011-01-21 5 200
Correspondence 2011-12-22 2 60