Note: Descriptions are shown in the official language in which they were submitted.
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
1
FIDELITY-OPTIMISED VARIABLE FRAME LENGTH ENCODING
TECHNICAL FIELD
The present invention relates in general to encoding of audio signals, and in
particular to encoding of multi-channel audio signals.
BACKGROUND
o There is a high market need to transmit and store audio signals at low bit
rate while maintaining high audio quality. Particularly, in cases where
transmission resources or storage is limited low bit rate operation is an
essential cost factor. This is typically the case, e.g. in streaming and
messaging applications in mobile communication systems such as GSM,
5 UMTS, or CDMA.
Today, there are no standardised codecs available providing high
stereophonic audio quality at bit rates that are economically interesting for
use in mobile communication systems. What is possible with available
o codecs is monophonic transmission of the audio signals. To some extent also
stereophonic transmission is available. However, bit rate limitations usually
require limiting the stereo representation quite drastically.
The simplest way of stereophonic or multi-channel coding of audio signals is
5 to encode the signals of the different channels separately as individual and
independent signals. Another basic way used in stereo FM radio
transmission and which ensures compatibility with legacy mono radio
receivers is to transmit a sum and a difference signal of the two involved
channels.
0
State-of-the-art audio codecs, such as MPEG-1 / 2 Layer III and MPEG-2 / 4
AAC make use of so-called joint stereo coding. According to this technique,
the signals of the different channels are processed jointly, rather than
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
2
separately and individually. The two most commonly used joint stereo coding
techniques are known as "Mid/Side" (M/S) stereo coding and intensity stereo
coding, which usually are applied on sub-bands of the stereo or multi-
channel signals to be encoded.
M/S stereo coding is similar to the described procedure in stereo FM radio,
in a sense that it encodes and transmits the sum and difference signals of
the channel sub-bands and thereby exploits redundancy between the
channel sub-bands. The structure and operation of an encoder based on
o M/S stereo coding is described, e.g. in US patent 5,285,498 by J.D.
Johnston.
Intensity stereo on the other hand is able to make use of stereo irrelevancy.
It transmits the joint intensity of the channels (of the different sub-bands)
5 along with some location information indicating how the intensity is
distributed among the channels. Intensity stereo does only provide spectral
magnitude information of the channels. Phase information is not conveyed.
For this reason and since the temporal inter-channel information (more
specifically the inter-channel time difference) is of major psycho-acoustical
o relevancy particularly at lower frequencies, intensity stereo can only be
used
at high frequencies above e.g. 2 kHz. An intensity stereo coding method is
described, e.g. in the European patent 0497413 by R. Veldhuis et al.
A recently developed stereo coding method is described, e.g. in a conference
5 paper with the title "Binaural cue coding applied to stereo and multi-
channel
audio compression", 112th AES convention, May 2002, Munich, Germany by
C. Faller et al. This method is a parametric multi-channel audio coding
method. The basic principle is that at the encoding side, the input signals
from N channels ci, c2, ... crr are combined to one mono signal m. The mono
o signal is audio encoded using any conventional monophonic audio codec. In
parallel, parameters are derived from the channel signals, which describe the
multi-channel image. The parameters are encoded and transmitted to the
decoder, along with the audio bit stream. The decoder first decodes the mono
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
3
signal m' and then regenerates the channel signals ci', c2',..., CN', based on
the parametric description of the multi-channel image.
The principle of the Binaural Cue Coding (BCC) method is that it transmits
the encoded mono signal and so-called BCC parameters. The BCC
parameters comprise coded inter-channel level differences and inter-channel
time differences for sub-bands of the original multi-channel input signal.
The decoder regenerates the different channel signals by applying sub-band-
wise level and phase adjustments of the mono signal based on the BCC
.0 parameters. The advantage over e.g. M/S or intensity stereo is that stereo
information comprising temporal inter-channel information is transmitted at
much lower bit rates. However, this technique requires computational
demanding time-frequency transforms on each of the channels, both at the
encoder and the decoder.
5
Moreover, BCC does not handle the fact that a lot of the stereo information,
especially at low frequencies, is diffuse, i.e. it does not come from any
specific direction. Diffuse sound fields exist in both channels of a stereo
recording but they are to a great extent out of phase with respect to each
0 other. If an algorithm such as BCC is subject to recordings with a great
amount of diffuse sound fields the reproduced stereo image will become
confused, jumping from left to right as the BCC algorithm can only pan the
signal in specific frequency bands to the left or right.
5 A possible means to encode the stereo signal and ensure good reproduction
of diffuse sound fields is to use an encoding scheme very similar to the
technique used in FM stereo radio broadcast, namely to encode the mono
(Left+Right) and the difference (Left-Right) signals separately.
o A technique, described in US patent 5,434,948 by C.E. Holt et al. uses a
similar technique as in BCC for encoding the mono signal and side
information. In this case, side information consists of predictor filters and
optionally a residual signal. The predictor filters, estimated by a least-mean-
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
4
square algorithm, when applied to the mono signal allow the prediction of
the multi-channel audio signals. With this technique one is able to reach
very low bit rate encoding of multi-channel audio sources, however, at the
expense of a quality drop, discussed further below.
Finally, for completeness, a technique is to be mentioned that is used in 3D
audio. This technique synthesises the right and left channel signals by
filtering sound source signals with so-called head-related filters. However,
this technique requires the different sound source signals to be separated
o and can thus not generally be applied for stereo or multi-channel coding.
SUMMARY
A problem with existing encoding schemes based on encoding of frames of
5 signals, in particular a main signal and one or more side signals, is that
the
division of audio information into frames may introduce unattractive
perceptual artefacts. Dividing the information into frames of relative long
duration generally reduces the average requested bit rate. This may be
beneficial e.g. for music containing a large amount of diffuse sound.
o However, for transient rich music or speech, the fast temporal variations
will
be smeared out over the frame duration, giving rise to ghost-like sounds or
even pre-echoing problems. Encoding short frames will instead give a more
accurate representation of the sound, minimising the energy, but requires
higher transmission bit rates and higher computational resources. The
5 coding efficiency as such may also decrease with very short frame lengths.
The introduction of more frame boundaries may also introduce
discontinuities in encoding parameters, which may appear as perceptual
artefacts.
o A further problem with schemes based on encoding of a main and one or
several side signals is that they often require relatively large computational
resources. In particular when short frames are used, handling
discontinuities in parameters from one frame to another is a complex task.
CA 02527971 2008-12-17
When long frames are used, estimation errors of transient sound
may cause very large side signals, in turn increasing the
transmission rate demand.
5 An object of the present invention is therefore to provide an
encoding method and device improving the perception quality of
multi-channel audio signals, in particular to avoid artefacts
such as pre-echoing, ghost-like sounds or frame discontinuity
artefacts. A further object of the present invention is to
provide an encoding method and device requiring less processing
power and having more constant transmission bit rate
requirements.
The above objects are achieved by methods and devices according
to the enclosed patent claims. In general words, polyphonic
signals are used to create a main signal, typically a mono
signal, and a side signal. The main signal is encoded according
to prior-art encoding principles. A number of encoding schemes
for the side signal are provided. Each encoding scheme is
characterised by a set of sub-frames of different lengths. The
total length of the sub-frames corresponds to the length of the
encoding frame of the encoding scheme. The sets of sub-frames
comprise at least one sub-frame. The encoding scheme to be used
on the side signal is selected at least partly dependent on the
present signal content of the polyphonic signals.
In one embodiment, the selection takes place, before the
encoding, based on signal characteristics analysis. In another
embodiment, the side signal is encoded by each of the encoding
schemes, and based on measurements of the quality of the
encoding, the best encoding scheme is selected.
In a preferred embodiment, a side residual signal is created as
the difference between the side signal and the main signal scaled
with a balance factor. The balance factor is selected to minimise
the side residual signal. The optimised side residual signal and
the balance factor are encoded and provided as parameters
representing the side signal. At the decoder side, the
CA 02527971 2010-01-22
6
balance factor, the side residual signal and the man signal are used to
recover the side signal.
In a further preferred embodiment, the encoding of the side signal comprises
effects. Furthermore,
an energy contour scaling in order to avoid pre-echoing
ZZ,
different encoding schemes may comprise different encoding procedures in
the separate sub-frames.
According to an aspect of the present invention there is
provided a method of encoding multi-channel audio signals,
comprising the steps of:
generating a first output signal;
said first output signal being encoding parameters
representing a main signal;
providing at least two encoding schemes, each of the at
least two encoding schemes comprising a respective set of
sub-frames together constituting an encoding frame, wherein
the sum of the lengths of the sub-frames in each encoding
scheme is equal to the length of the encoding frame, each
set of sub-frames comprising at least one sub-frame;
generating a second output signal within all sub-frames
of the respective set of sub-frames of each of the at least
two encoding schemes separately;
said second output signal being encoding parameters
representing a side signal within the encoding frame;
said main signal being a first linear combination of
signals of at least a first and a second channel;
said side signal being a second linear combination of
signals of at least the first and the second channel;
said step of generating the second output signal
comprises the step of selecting an encoding scheme from the
at least two encoding schemes at least to a part dependent
of the present signal content of the side signal by
CA 02527971 2010-01-22
6a
calculating a total fidelity measure for each of the at
least two encoding schemes and selecting an encoded signal
from the encoding scheme having the best fidelity measure
as the encoding parameters representing the side signal.
According to another aspect of the present invention there
is provided a method of decoding multi-channel audio
signals, the method comprising the steps of:
decoding encoding parameters representing a main signal
into a decoded main signal;
said main signal being a first linear combination of
signals of at least a first and a second channel;
decoding encoding parameters representing a side signal
within an encoding frame into a decoded side signal;
said side signal being a second linear combination of
signals of at least the first and the second channel; and
combining at least the decoded main signal and the
decoded side signal in linear combinations into said
signals of at least said first and said second channel,
providing at least two encoding schemes, each of the at
least two encoding schemes comprising a respective set of
sub-frames together constituting the encoding frame,
wherein, the sum of the lengths of the sub-frames in each
encoding scheme is equal to the length of the encoding
frame;
wherein each set of sub-frames comprising at least one
sub-frame,
wherein the step of decoding the encoding parameters
representing the side signal in turn comprises the step of
decoding the encoding parameters representing the side
signal separately in the sub-frames of one of the at least
two encoding schemes.
CA 02527971 2010-01-22
6b
According to a further aspect of the present invention
there is provided an encoder apparatus, comprising:
input means for multi-channel audio signals comprising at
least a first and a second channel; and
means for generating a first output signal;
said first output signal being encoding parameters
representing a main signal;,
means for providing at least two encoding schemes, each
of the at least two encoding schemes comprising a
respective set of sub-frames together constituting an
encoding frame, wherein the sum of the lengths of the sub-
frames in each encoding scheme is equal to the length of
the encoding frame;
each set of sub-frames comprising at least one sub-frame;
means for generating a second output signal within all
sub-frames of the respective sets of sub-frames of each of
the at least two encoding schemes separately;
said second output signal being encoding parameters
representing a side signal within the encoding frame;
said main signal being a first linear combination of at
least a first and a second channel;
said side signal being a second linear combination of at
least the first and the second channel;
said means for generating the second output signal in
turn comprising means for selecting an encoding scheme from
the at least two encoding schemes at least to a part
dependent of the present signal content of the side signal
by calculating a total fidelity measure for each of the at
least two encoding schemes and selecting an encoded signal
from the encoding scheme having the best fidelity measure
as the encoding parameters representing the side signal;
and
CA 02527971 2010-01-22
6c
output means being arranged for outputting said first
output signal and said second output signal.
According to a further aspect of the present invention
there is provided a decoder apparatus, comprising:
input means for encoding parameters representing a main
signal and encoding parameters representing a side signal;
means for decoding the encoding parameters representing
the main signal into a decoded main signal;
said main signal being a first linear combination of
signals of at least a first and a second channel;
means for decoding the encoding parameters representing
the side signal within an encoding frame into a decoded
side signal;
said side signal being a second linear combination of
signals of at least the first and the second channel;
means for combining at least the decoded main signal and
the decoded side signal in linear combinations into signals
of at least a first and a second channel; and
output means arranged for outputting said signals of said
first and second channels,
wherein the means for decoding the encoding parameters
representing the side signal in turn comprises:
means for providing at least two encoding schemes,
each of the at least two encoding schemes comprising a
respective set of sub-frames together constituting the
encoding frame, wherein the sum of the lengths of the
sub-frames in each encoding scheme is equal to the
length of the encoding frame;
each set of sub-frames comprising at least one sub-
frame; and
CA 02527971 2010-01-22
6d
means for decoding the encoding parameters
representing the side signal separately in the sub-
frames of one of the at least two encoding schemes.
According to a further aspect of the present invention
there is provided audio system comprising at least one of:
an encoder apparatus as described herein; or
a decoder apparatus as described herein; or both.
According to one aspect of the invention there is provided
a method of encoding multi-channel audio signals,
comprising the steps of:
generating a first output signal being encoding
parameters representing a main signal;
said main signal being a first linear combination of
signals of at least a first and a second channel; and
generating a second output signal being encoding
parameters representing a side signal;
said side signal being a second linear combination of
signals of at least the first and the second channel within
an encoding frame,
wherein the step of generating the second output signal
further comprises the step of:
scaling the side signal to an energy contour of the
main signal.
According to a further aspect of the invention there is
provided a method of decoding multi-channel audio signals,
comprising the steps of:
generating a decoded main signal from encoding parameters
representing a main signal;
CA 02527971 2010-01-22
6e
said main signal being a first linear combination of
signals of at least a first and a second channel;
generating a decoded side signal from encoding parameters
representing a side signal;
said side signal being a second linear combination of
signals of at least a first and a second channel within an
encoding frame; and
combining at least the decoded main signal and the
decoded side signal into signals of at least said first and
said second channel,
wherein the step of generating a decoded side signal
further comprises the step of:
scaling the decoded side signal to an energy contour
of the decoded main signal.
According to another aspect of the invention there is
provided encoder apparatus, comprising:
input means for multi-channel audio signals comprising at
least a first and a second channel,
means for generating a first output signal being encoding
parameters representing a main signal;
said main signal being a first linear combination of
signals of at least the first and the second channel;
means for generating a second output signal being
encoding parameters representing a side signal;
said side signal being a second linear combination of
signals of at least the first and the second channel,
within an encoding frame; and
output means;
wherein the means for generating a second output signal
further comprises:
CA 02527971 2010-01-22
6f
means for scaling the side signal to an energy
contour of the main signal.
According to yet another aspect of the invention there is
provided decoder apparatus, comprising:
input means for encoding parameters representing a main
signal and encoding parameters representing a side signal;
said main signal being a first linear combination of a
first and a second channel;
said side signal being a second linear combination of a
first and a second channel;
means for generating a decoded main signal from the
encoding parameters representing the main signal;
means for generating a decoded side signal from the
encoding parameters representing the side signal within an
encoding frame;
means for combining at least the decoded main signal and
the decoded side signal into signals of at least a first
and a second channel; and
output means,
wherein the means for generating a decoded side signal in
turn comprises:
means for scaling the decoded side signal to an
energy contour of the decoded main signal.
The main advantage with the present invention is that the preservation of
the perception of the audio signals is improved. Furthermore, the present
invention still allows multi-channel signal transmission at very low bit
rates.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention, together with further objects and advantages thereof, may best
be understood by making reference to the following description taken together
CA 02527971 2010-01-22
6g
with the accompanying drawings, in which:
FIG. 1 is a block scheme of a system for transmitting polyphonic signals;
FIG. 2a is a block diagram of an encoder in a transmitter;
FIG. 2b is a block diagram of a decoder in a receiver;
FIG. 3a is a diagram illustrating encoding frames of different lengths;
FIGS. 3b and 3c are block diagrams of embodiments of side signal encoder
units according to the present invention;
FIG. 4 is a block diagram of an embodiment of an encoder using balance
factor encoding of side signal;
FIG. 5 is a block diagram of an embodiment of an encoder for multi-signal
FIG_ 6 is a block diagram of an embodiment of a decoder suitable for
decoding signals from the device of Fig. 5;
FIG. 7a and b are diagrams illustrating a pre-echo artefact;
FIG. 8 is a block diagram of an embodiment of a side signal encoder unit
according to the present invention, employing different encoding principles in
different sub-frames;
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
7
FIG. 9 illustrates the use of different encoding principles in different
frequency sub-bands;
FIG. 10 is a flow diagram of the basic steps of an embodiment of an
encoding method according to the present invention; and
FIG. 11 is a flow diagram of the basic steps of an embodiment of a
decoding method according to the present invention.
DETAILED DESCRIPTION
o Fig. 1 illustrates a typical system 1, in which the present invention
advantageously can be utilised. A transmitter 10 comprises an antenna 12
including associated hardware and software to be able to transmit radio
signals 5 to a receiver 20. The transmitter 10 comprises among other parts a
multi-channel encoder 14, which transforms signals of a number of input
5 channels 16 into output signals suitable for radio transmission. Examples of
suitable multi-channel encoders 14 are described in detail further below.
The signals of the input channels 16 can be provided from e.g. an audio
signal storage 18, such as a data file of digital representation of audio
recordings, magnetic tape or vinyl disc recordings of audio etc. The signals
of
o the input channels 16 can also be provided in "live", e.g. from a set of
microphones 19. The audio signals are digitised, if not already in digital
form, before entering the multi-channel encoder 14.
At the receiver 20 side, an antenna 22 with associated hardware and
5 software handles the actual reception of radio signals 5 representing
polyphonic audio signals. Here, typical functionalities, such as e.g. error
correction, are performed. A decoder 24 decodes the received radio signals 5
and transforms the audio data carried thereby into signals of a number of
output channels 26. The output signals can be provided to e.g. loudspeakers
0 29 for immediate presentation, or can be stored in an audio signal storage
28 of any kind.
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
8
The system 1 can for instance be a phone conference system, a system for
supplying audio services or other audio applications. In some systems, such
as e.g. the phone conference system, the communication has to be of a
duplex type, while e.g. distribution of music from a service provider to a
subscriber can be essentially of a one-way type. The transmission of signals
from the transmitter 10 to the receiver 20 can also be performed by any
other means, e.g. by different kinds of electromagnetic waves, cables or
fibres as well as combinations thereof.
o Fig. 2a illustrates an embodiment of an encoder according to the present
invention. In this embodiment, the polyphonic signal is a stereo signal
comprising two channels a and b, received at input 16A and 16B,
respectively. The signals of channel a and b are provided to a pre-processing
unit 32, where different signal conditioning procedures may be performed.
5 The (perhaps modified) signals from the output of the pre-processing unit 32
are summed in an addition unit 34. This addition unit 34 also divides the
sum by a factor of two. The signal xmono produced in this way is a main
signal of the stereo signals, since it basically comprises all data from both
channels. In this embodiment the main signal thus represents a pure
o "mono" signal. The main signal xmono is provided to a main signal encoder
unit 38, which encodes the main signal according to any suitable encoding
principles. Such principles are available within prior-art and are thus not
further discussed here. The main signal encoder unit 38 gives an output
signal pmono, being encoding parameters representing a main signal.
5
In a subtraction unit 36, a difference (divided by a factor of two) of the
channel signals is provided as a side signal Xside. In this embodiment, the
side signal represents the difference between the two channels in the stereo
signal. The side signal Xside is provided to a side signal encoding unit 30.
o Preferred embodiments of the side signal encoding unit 30 will be discussed
further below. According to a side signal encoding procedure, which will be
described more in detail further below, the side signal Xside is transferred
into
encoding parameters pside representing a side signal Xside. In certain
CA 02527971 2008-12-17
9
embodiments, this encoding takes place utilising also information
of the main signal Xmono= The arrow 42 indicates such a provision,
where the original uncoded main signal Xmono is utilised. In
further other embodiments, the main signal information that is
used in the side signal encoding unit 30 can be deduced from the
encoding parameters Pmono representing the main signal, as
indicated by the broken line 44.
The encoding parameters Pmono representing the main signal xmono is
a first output signal, and the encoding parameters Pside
representing the side signal Xiide is a second output signal. In a
typical case, these two output signals Pmono, Pside, together
representing the full stereo sound, are multiplexed into one
transmission signal 52 in a multiplexor unit 40. However, in
other embodiments, the transmission of the first and second
output signals pmono Aside may take place separately.
In Fig. 2b, an embodiment of a decoder 24 according to the
present invention is illustrated as a block scheme. The received
signal 54, comprising encoding parameters representing the main
and side signal information are provided to a demultiplexor unit
56, which separates a first and second input signal,
respectively. The first input signal, corresponding to encoding
parameters Pmono of a main signal, is provided to a main signal
decoder unit 64. In a conventional manner, the encoding
parameters pmono representing the main signal are used to generate
an decoded main signal x"mono, being as similar to the main signal
xmono (Fig. 2a) of the encoder 14 (Fig. 2a) as possible.
Similarly, the second input signal, corresponding to a side
signal, is provided to a side signal decoder unit 60. Here, the
encoding parameters Aside representing the side signal are used to
recover a decoded side signal X"side. In some embodiments, the
decoding procedure utilises information about the main signal
X"mono. as indicated by an arrow.
The decoded main and side signals X"mono. X"side are provided to an
addition unit 70, which provides an output signal that is a
representation of the
CA 02527971 2008-12-17
original signal of channel a. Similarly, a difference provided by
a subtraction unit 68 provides an output signal that is a
representation of the original signal of channel b. These channel
signals may be post-processed in a post-processor unit 74
5 according to prior-art signal processing procedures. Finally, the
channel signals a and b are provided at the outputs 26A and 26B
of the decoder.
As mentioned in the summary, encoding is typically performed in
10 one frame at a time. A frame comprises audio samples within a
pre-defined time period. In the bottom part of Fig. 3a, a frame
SF2 of time duration L is illustrated. The audio samples within
the unhatched portion are to be encoded together. The preceding
samples and the subsequent samples are encoded in other frames.
The division of the samples into frames will in any case
introduce some discontinuities at the frame borders. Shifting
sounds will give shifting encoding parameters, changing basically
at each frame border. This will give rise to perceptible errors.
One way to compensate somewhat for this is to base the encoding,
not only on the samples that are to be encoded, but also on
samples in the absolute vicinity of the frame, as indicated by
the hatched portions. In such a way, there will be a softer
transfer between the different frames. As an alternative, or
complement, interpolation techniques are sometimes also utilised
for reducing perception artefacts caused by frame borders.
However, all such procedures require large additional
computational resources, and for certain specific encoding
techniques, it might also be difficult to provide it with any
resources.
In this view, it is beneficial to utilise as long frames as
possible, since the number of frame borders will be small. Also
the coding efficiency typically becomes high and the necessary
transmission bit-rate will typically be minimised. However, long
frames give problems with pre-echo artefacts and ghost-like
sounds.
By instead utilising shorter frames, such as SFl or even SFO,
having the durations of L/2 and L/4, respectively, anyone skilled
in the art realises that
CA 02527971 2008-12-17
11
the coding efficiency may be decreased, the transmission bit-rate
may have to be higher and the problems with frame border
artefacts will increase. However, shorter frames suffer less from
e.g. other perception artefacts, such as ghost-like sounds and
pre-echoing. In order to be able to minimise the coding error as
much as possible, one should use an as short frame length as
possible.
According to the present invention, the audio perception will be
improved by using a frame length for encoding of the side signal
that is dependent on the present signal content. Since the
influence of different frame lengths on the audio perception will
differ depending on the nature of the sound to be encoded, an
improvement can be obtained by letting the nature of the signal
itself affect the frame length that is used. The encoding of the
main signal is not the object of the present invention and is
therefore not described in detail. However, the frame lengths
used for the main signal may or may not be equal to the frame
lengths used for the side signal.
Due to small temporal variations, it may e.g. in some cases be
beneficial to encode the side signal with use of relatively long
frames. This may be the case with recordings with a great amount
of diffuse sound field such as concert recordings. In other
cases, such as stereo speech conversation, short frames are
probably to prefer. The decision which frame length is to prefer
can be performed in two basic ways.
One embodiment of a side signal encoder unit 30 according to the
present invention is illustrated in Fig. 3b, in which a closed
loop decision is utilised. A basic encoding frame of length L is
used here. A number of encoding schemes 81, characterised by a
separate set 80 of sub-frames, are created. Each set 80 of sub-
frames comprises one or more sub-frames of equal or differing
lengths. The total length of the set 80 of sub-frames is,
however, always equal to the basic encoding frame length L. With
references to Fig. 3b, the top encoding scheme is characterised
by a set of sub-frames comprising only one sub-frame of length L.
The next set of
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
12
frames comprises two frames of length L/2. The third set comprises two
frames of length L/4 followed by a L/2 frame.
The signal Xside provided to the side signal encoder unit 30 is encoded by all
encoding schemes 81. In the top encoding scheme, the entire basic encoding
frame is encoded in one piece. However, in the other encoding schemes, the
signal Xside is encoded in each sub-frame separately from each other. The
result from each encoding scheme is provided to a selector 85. A fidelity
measurement means 83 determines a fidelity measure for each of the
o encoded signals. The fidelity measure is an objective quality value,
preferably
a signal-to-noise measure or a weighted signal-to-noise ratio. The fidelity
measures associated with each encoding scheme are compared and the
result controls a switching means 87 to select the encoding parameters
representing the side signal from the encoding scheme giving the best fidelity
5 measure as the output signal pside from the side signal encoder unit 30.
Preferably, all possible combinations of frame lengths are tested and the set
of sub-frames that gives the best objective quality, e.g. signal-to-noise
ratio
is selected.
0
In the present embodiment, the lengths of the sub-frames used are selected
according to:
lsf = if 2's
5
where lsf are the lengths of the sub-frames, l f is the length of the encoding
frame and n is an integer. In the present embodiment, n is selected between 0
and 3. However, any frame lengths will be possible to use as long as the total
length of the set is kept constant.
0
In Fig. 3c, another embodiment of a side signal encoder unit 30 according to
the present invention is illustrated. Here, the frame length decision is an
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
13
open loop decision, based on the statistics of the signal. In other words, the
spectral characteristics of the side signal will be used as a base for
deciding
which encoding scheme that is going to be used. As before, different
encoding schemes characterised by different sets of sub-frames are
available. However, in this embodiment, the selector 85 is placed before the
actual encoding. The input side signal aside enters the selector 85 and a
signal analysing unit 84. The result of the analysis becomes the input of a
switch 86, in which only one of the encoding schemes 81 are utilised. The
output from that encoding scheme will also be the output signal pside from
o the side signal encoder unit 30.
The advantage with an open loop decision is that only one actual encoding
has to be performed. The disadvantage is, however, that the analysis of the
signal characteristics may be very complicated indeed and it may be difficult
5 to predict possible behaviours in advance to be able to give an appropriate
choice in the switch 86. A lot of statistical analysis of sound has to be
performed and included in the signal analysing unit 84. Any small change in
the encoding schemes may turn upside down on the statistical behaviour.
o By using closed loop selection (Fig. 3b), encoding schemes may be exchanged
without making any changes in the rest of the unit. On the other hand, if
many encoding schemes are to be investigated, the computational
requirements will be high.
5 The benefit with such a variable frame length coding for the side signal is
that one can select between a fine temporal resolution and coarse frequency
resolution on one side and coarse temporal resolution and fine frequency
resolution on the other. The above embodiments will preserve the stereo
image in the best possible manner.
0
There are also some requirements on the actual encoding utilised in the
different encoding schemes. In particular when the closed loop selection is
used, the computational resources to perform a number of more or less
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
14
simultaneous encoding have to be large. The more complicated the encoding
process is, the more computational power is needed. Furthermore, a low bit
rate at transmission is also to prefer.
The method presented in US 5,434,948, uses a filtered version of the mono
(main) signal to resemble the side or difference signal. The filter parameters
are optimised and allowed to vary in time. The filter parameters are then
transmitted representing an encoding of the side signal. In one embodiment,
also a residual side signal is transmitted. In many cases, such an approach
would be possible to use as side signal encoding method within the scope of
the present invention. This approach has, however, some disadvantages. The
quantisation of the of the filter coefficients and any residual side signal
often
require relatively high bit rates for transmission, since the filter order has
to
be high to provide an accurate side signal estimate. The estimation of the
filter itself may be problematic, especially in cases of transient rich music.
Estimation errors will give a modified side signal that is sometimes larger in
magnitude than the unmodified signal. This will lead to higher bit rate
demands. Moreover, if a new set of filter coefficients are computed every N
samples, the filter coefficients need to be interpolated to yield a smooth
D transition from one set of filter coefficients to another, as discussed
above.
Interpolation of filter coefficients is a complex task and errors in the
interpolation will manifest itself in large side error signals leading to
higher
bit rates needed for the difference error signal encoder.
5 A means to avoid the need for interpolation is to update the filter
coefficients
on a sample-by-sample basis and rely on backwards-adaptive analysis. For
this to work well it is needed that the bit rate of the residual encoder is
fairly
high. This is therefore not a good alternative for low bit rate stereo coding.
3 There exist cases, e.g. quite common with music, where the mono and the
difference signals are almost un-correlated. The filter estimation then
becomes very troublesome with the added risk of just making things worse
for the difference error signal encoder.
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
The solution according to US 5,434,948 can work pretty well in cases where
the filter coefficients vary very slowly in time, e.g. conference telephony
systems. In the case of music signals, this approach does not work very well
5 as the filters need to change very fast to track the stereo image. This
means
that sub-frame lengths of very differing magnitude has to be utilised, which
means that the number of combinations to test increases rapidly. This in
turn means that the requirements for computing all possible encoding
schemes becomes impracticably high.
0
Therefore, in a preferred embodiment, the encoding of the side signal is
based on the idea to reduce the redundancy between the mono and side
signal by using a simple balance factor instead of a complex bit rate
consuming predictor filter. The residual of this operation is then encoded.
5 The magnitude of such a residual is relatively small and does not call for
very high bit rate need for transfer. This idea is very suitable indeed to
combine with the variable frame set approach described earlier, since the
computational complexity is low.
o The use of a balance factor combined with the variable frame length
approach removes the need for complex interpolation and the associated
problems that interpolation may cause. Moreover, the use of a simple
balance factor instead of a complex filter gives fewer problems with
estimation as possible estimation errors for the balance factor has less
5 impact. The preferred solution will be able to reproduce both panned signals
and diffuse sound fields with good quality and with limited bit rate
requirements and computational resources.
Fig. 4 illustrates a preferred embodiment of a stereo encoder according to the
o present invention. This embodiment is very similar to the one shown in Fig.
2a, however, with the details of the side signal encoder unit 30 revealed. The
encoder. 14 of this embodiment does not have any pre-processing unit, and
the input signals are provided directly to the addition and subtraction units
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
16
34, 36. The mono signal xmono is multiplied with a certain balance factor gsm
in a multiplier 33. In a subtraction unit 35, the multiplied mono signal is
subtracted from the side signal aside, i.e. essentially the difference between
the two channels, to produce a side residual signal. The balance factor gsm is
determined based on the content of the mono and side signals by the
optimiser 37 in order to minimise the side residual signal according to a
quality criterion. The quality criterion is preferably a least mean square
criterion. The side residual signal is encoded in a side residual encoder 39
according to any encoder procedures. Preferably, the side residual encoder
o 39 is a low bit rate transform encoder or a CELP (Codebook Excited Linear
Prediction) encoder. The encoding parameters pside representing the side
signal then comprises the encoding parameters pside residual representing the
side residual signal and the optimised balance factor 49.
5 In the embodiment of Fig. 4, the mono signal 42 used for synthesising the
side signals is the target signal xmono for the mono encoder 38. As mentioned
above (in connection with Fig. 2a), the local synthesis signal of the mono
encoder 38 can also be utilised. In the latter case, the total encoder delay
may be increased and the computational complexity for the side signal may
o increase. On the other hand, the quality may be better as it is then
possible
to repair coding errors made in the mono encoder.
In a more mathematical way, the basic encoding scheme can be described as
follows. Denote the two channel signals as a and b, which may be the left
5 and right channel of a stereo pair. The channel signals are combined into a
mono signal by addition and to a side signal by a subtraction. In equation
form, the operations are described as:
x.,,,, (i) = 0.5(a(n)+ b(z))
0 xS,de(n)=O.5(a(n)-b(n))
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
17
It is beneficial to scale the Xmono and Xside signals down by a factor of two.
It is
here implied that other ways of creating the Xmono and Xside exist. One can
for
instance use:
xmono (n) = ya(i)+ (1- y)b(n)
xside(z) = ya(i)- (1- y)b(n)
0<y<-1.0 .
On blocks of the input signals, a modified or residual side signal is
o computed according to:
xsideresidual (n) - 'xside (n) - Ax o,onX side )Xniono (32) ,
where f(Xmono,Xside) is a balance factor function that based on the block on N
5 samples, i.e. a sub-frame, from the side and mono signals strive to remove
as much as possible from the side signal. In other words, the balance factor
is used to minimise the residual side signal. In the special case where it is
minimised in a mean square sense, this is equivalent to minimising the
energy of the residual side signal Xside residual.
0
In the above mentioned special case f (xmoõa , xside) is described as:
Ax Rs :
mono , xside)
Rnm:
frame end
Rn n = I X.ono (n)xmono (n)
n= frame start
frame end
R,,, = Yxside(n)xmono (n)
n=f amesfa=t
where Xside is the side signal and Xmono is the mono signal. Note that the
function is based on a block starting at "frame start" and ending at "frame
end".
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
18
It is possible to add weighting in the frequency domain to the computation of
the balance factor. This is done by convoluting the Xside and Xmono signals
with the impulse response of a weighting filter. It is then possible to move
the estimation error to a frequency range where they are less easy to hear.
This is referred to as perceptual weighting.
A quantized version of the balance factor value given by the function
('xmo o, aside) is transmitted to the decoder. It is preferable to account for
the
o quantization already when the modified side signal is generated. The
expression below is then achieved:
'xsideresidual (n) = 'aside (n) gQ'xmw,o (n)
gQ =Q g 1 Qg R sm
mm
5
Qg (..) is a quantization function that is applied to the balance factor given
by
the function f (Xmo,to,Xside) = The balance factor is transmitted on the
transmission channel. In normal left-right panned signals the balance factor
is limited to the interval [-1.0 1.0]. If on the other hand the channels are
out
0 of phase with regards to one another, the balance factor may extend beyond
these limits.
As an optional means to stabilise the stereo image, one can limit the balance
factor if the normalised cross correlation between the mono and the side
5 signal is poor as given by the equation below:
gQ =Q g1 Qg IR s, R s,
where
3
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
19
R = Rsm
-sm
frame end
r
Rs : = Y, xside (n)xmono (n)
n=frame start
These situations occur quite frequently with e.g. classical music or studio
music with a great amount of diffuse sounds, where in some cases the a and
b channels might almost cancel out one another on occasions when a mono
signal is created. The effect on the balance factor is that is can jump
rapidly,
causing a confused stereo image. The fix above alleviates this problem.
o The filter-based approach in US 5,434,948 has the similar problems, but in
that case the solution is not so simple.
If ES is the encoding function (e.g. a transform encoder) of the residual side
signal and EM is the encoding function of the mono signal, then the decoded
5 a" and b" signals in the decoder end can be described as (it is assumed here
that y = 0.5).
a "(n) = (1 + gQ )xmono (iZ) + .xside (n)
b (n) = (1 - gQ) "ono (n) - xside ()
o xside = Es 1 (Es (xsideresidual
xn) ono = E ::1(E,n (xmono ))
One important benefit from computing the balance factor for each frame is
that one avoids the use of interpolation. Instead, normally, as described
5 above, the frame processing is performed with overlapping frames.
The encoding principle using balance factors operates particularly well in the
case of music signals, where fast changes typically are needed to track the
stereo image.
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
Lately, multi-channel coding has become popular. One example is
5.1channel surround sound in DVD movies. The channels are there
arranged as: front left, front centre, front right, rear left, rear right and
5 subwoofer. In Fig. 5, an embodiment of an encoder that encodes the three
front channels in such an arrangement exploiting interchannel redundancies
according to the present invention is shown.
Three channel signals L, C, R are provided on three inputs 16A-C, and the
o mono signal xmono is created by a sum of all three signals. A centre signal
encoder unit 130 is added, which receives the centre signal Xcentre. The mono
signal 42 is in this embodiment the encoded and decoded mono signal
X mono, and is multiplied with a certain balance factor gQ in a multiplier
133.
In a subtraction unit 135, the multiplied mono signal is subtracted from the
5 centre signal Xcentre, to produce a centre residual signal. The balance
factor
gQ is determined based on the content of the mono and centre signals by an
optimiser 137 in order to minimise the centre residual signal according to
the quality criterion. The centre residual signal is encoded in a centre
residual encoder 139 according to any encoder procedures. Preferably, the
o centre residual encoder 139 is a low bit rate transform encoder or a CELP
encoder. The encoding parameters pcentre representing the centre signal then
comprises the encoding parameters pcentre residual representing the centre
residual signal and the optimised balance factor 149. The centre residual
signal and the scaled mono signal are added in an addition unit 235,
5 creating a modified centre signal 142 being compensated for encoding errors.
The side signal xside, i.e. the difference between the left L and right R
channels is provided to the side signal encoder unit 30 as in earlier
embodiments. However, here, the optimiser 37 also depends on the modified
o centre signal 142 provided by the centre signal encoder unit 130. The side
residual signal will therefore be created as an optimum linear combination of
the mono signal 42, the modified centre signal 142 and the side signal in the
subtraction unit 35.
CA 02527971 2008-12-17
21
The variable frame length concept described above can be applied
on either of the side and centre signals, or on both.
Fig. 6 illustrates a decoder unit suitable for receiving encoded
audio signals from the encoder unit of Fig. 5. The received
signal 54 is divided into encoding parameters pmono representing
the main signal, encoding parameters Pcentre representing the
centre signal and encoding parameters Aside representing the side
signal. In the decoder 64, the encoding parameters Pmono
representing the main signal are used to generate a main signal
X"mono. In the decoder 160, the encoding parameters Pcentre
representing the centre signal are used to generate a centre
signal X"centre, based on main signal X"mono. In the decoder 60, the
encoding parameters Aside representing the side signal are
decoded, generating a side signal x"side, based on main signal
X"mono and centre signal X"centre
The procedure can be mathematically expressed as follows:
The input signals Xleft, Xight and Xcentre are combined to a mono
channel according to:
Xmono (n) = OX,, i (n) + /Fright (n) + 2centre (n)
a, and x are in the remaining section set to 1.0 for
simplicity, but they can be set to arbitrary values. The a, P
and x values can be either constant or dependent of the signal
contents in order to emphasise one or two channels in order to
achieve an optimal quality.
The normalised cross correlation between the mono and the centre
signal is computed as:
_ R
R cm
cm
Rrr'Rmm
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
22
where
frame end \1 r l
Rce Y, X centre (nlxcentre (n)
n=fi=amestart
frame end
)l xmono r
Rmm = Y, xmono (1
(n)
n=frame start
frame end (11)]
Rcm Y, xcentre (n)xmono n=frame start
xcentre is the centre signal and x,,,oõa is the mono signal. The mono signal
comes from the mono target signal but it is possible to use the local
o synthesis of the mono encoder as well.
The centre residual signal to be encoded is:
xcentreresidual (n) = xcentre (72) gQxmono (n)
5 go = Q g , Qg Rcm
Rmm
Qg () is a quantization function that is applied to the balance factor. The
balance factor is transmitted on the transmission channel.
o If Ee is the encoding function (e.g. a transform encoder) of the centre
residual signal and E,,, is the encoding function of the mono signal then the
decoded xcentre signal in the decoder end can be described as:
xcentre (n) = gQxmono (n) + xcentre residual (n)
5 xcentreresidual = Ec 1 (E c (xcentreresidunl ))
xmono = E 1 (Ent (xmmno ))
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
23
The side residual signal to be encoded is:
xsideresidual (n) = `xleft (n) _ X right (n)/ 9Qsntxmono (1Z) gQscxcentre (n),
where gQ,,,, and gQSe are quantized values of the parameters g3,,, and gSe
that
minimises the expression:
frame end rr
G(xleft (n)-X, ight (ngsntxmono (n) gscxcettre (n$
n= framestart
0
77 can for instance be equal to 2 for a least square minimisation of the
error.
The gS,,, and gSe parameters can be quantized jointly or separately.
If ES is the encoding function of the side residual signal, then the decoded
5 xi~ft and aright channel signals are given as:
xleft (n) = xrnono (n) - xceth.e (n) + xside (ii)
aright (n) = xntono (n) - xcettre (n) - xside (n)
xside (n) = xsideresidual + g smxmono \n) + g xn tr(n)
Q Qsc cene
o xsideresidual = Es 1 `Es ( side residual
One of the perception artefacts that are most annoying is the pre-echo effect.
In Fig. 7a-b, diagrams are illustrating such an artefact. Assume a signal
component having the time development as shown by curve 100. In the
5 beginning, starting from to, the signal component is not present in the
audio
sample. At a time t between t1 and t2, the signal component suddenly
appears. When the signal component is encoded, using a frame length of t2-
t1, the occurrence of the signal component will be "smeared out" over the
entire frame, as indicated in curve 101. If a decoding takes place of the
curve
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
24
101, the signal component appears a time At before the intended appearance
of the signal component, and a "pre-echo" is perceived.
The pre-echoing artefacts become more accentuated if long encoding frames
are used. By using shorter frames, the artefact is somewhat suppressed.
Another way to deal with the pre-echoing problems described above is to
utilise the fact that the mono signal is available at both the encoder and
decoder end. This makes it possible to scale the side signal according to the
energy contour of the mono signal. In the decoder end, the inverse scaling is
o performed and thus some of the pre-echo problems may be alleviated.
An energy contour of the mono signal is computed over the frame as:
rnt+L
E e (m) _ Y w(n)xmona (n) , frame start 5 m S frame end ,
L n=m-L
5
where w(n) is a windowing function. The simplest windowing function is a
rectangular window, but other window types such as a hamming window
may be more desirable.
o The side residual signal is then scaled as:
xsideresidual (n)
xsideresidual (11) = , framestart <_ n <_ frame end
Ee n
In a more general form the equation above can be written as:
5
_ 'xsideresidual (n)
xsideresidual (n) = f (Ee (n)) , frame start 5 n _< frame end ,
where f (.) is a monotonic continuous function. In the decoder, the energy
contour is computed on the decoded mono signal and is applied to the
3 decoded side signal as:
CA 02527971 2008-12-17
x",,d (n) = x"(n)f (E,. (n)), frame start 5 n:5 frame end
Since this energy contour scaling in some sense is alternative to
5 the use of shorter frame lengths, this concept is particularly
well suited to be combined with the variable frame length
concept, described further above. By having some encoding schemes
that applies energy contour scaling, some that do not and some
that applies energy contour scaling only during certain sub-
10 frames, a more flexible set of encoding schemes may be provided.
In Fig. 8, an embodiment of a signal encoder unit 30 according to
the present invention is illustrated. Here, the different
encoding schemes 81 comprise hatched sub-frames, representing
encoding applying the energy contour scaling, and un-hatched sub-
15 frames, representing encoding procedures not applying the energy
contour scaling. In this manner, combinations not only of sub-
frames of differing lengths, but sub-frames also of differing
encoding principles are available. In the present explanatory
example, the application of energy contour scaling differs
20 between different encoding schemes. In a more general case, any
encoding principles can be combined with the variable length
concept in an analogous manner.
The set of encoding schemes of Fig. 8 comprises schemes that
25 handle e.g. pre-echoing artefacts in different ways. In some
schemes, longer sub-frames with pre-echoing minimisation
according to the energy contour principle are used. In other
schemes, shorter sub-frames without energy contour scaling are
utilised. Depending on the signal content, one of the
alternatives may be more advantageous. For very severe pre-
echoing cases, encoding schemes utilising short sub-frames with
energy contour scaling may be necessary.
The proposed solution can be used in the full frequency band or
in one or more distinct sub bands. The use of sub-band can be
applied either on both the main and side signals, or on one of
them separately. A preferred embodiment comprises a split of the
side signal in several frequency bands. The reason is simply that
it is easier to remove the possible redundancy in
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
26
an isolated frequency band than in the entire frequency band. This is
particularly important when encoding music signals with rich spectral
content.
One possible use is to encode the frequency band below a pre-determined
threshold with the above method. The pre-determined threshold can
preferably be 2 kHz, or even more preferably 1 kHz. For the remaining part
of the frequency range of interest, one can either encode another additional
frequency band with the above method, or use a completely different
o method.
One motivation to use the above method preferably for low frequencies is
that the diffuse sound fields generally have little energy content at high
frequencies. The natural reason is that sound absorption typically increases
5 with frequency. Also, the diffuse sound field components seem to play a less
important role for the human auditory system at higher frequencies.
Therefore, it is beneficial to employ this solution at low frequencies (below
1
or 2 kHz) and rely on other, even more bit efficient coding schemes at higher
frequencies. The fact that the scheme is only applied at low frequencies gives
o a large saving in bit rate as the necessary bit rate with the proposed
method
is proportional to the required bandwidth. In most cases, the mono encoder
can encode the entire frequency band, while the proposed side signal
encoding is suggested to be performed only in the lower part of the frequency
band, as schematically illustrated by Fig. 9. Reference number 301 refers to
5 an encoding scheme according to the present invention of the side signal,
reference number 302 refers to any other encoding scheme of the side signal
and reference number 303 refers to an encoding scheme of the side signal.
There also exist the possibility to use the proposed method for several
distinct frequency bands.
In Fig. 10, the main steps of an embodiment of an encoding method
according to the present invention are illustrated as a flow diagram. The
CA 02527971 2005-12-01
WO 2005/059899 PCT/SE2004/001867
27
procedure starts in step 200. In step 210, a main signal deduced from the
polyphonic signals is encoded. In step 212, encoding schemes are provided,
which comprise sub-frames with differing lengths and/or order. A side signal
deduced in step 214 from the polyphonic signals is encoded by an encoding
scheme selected dependent at least partly on the actual signal content of the
present polyphonic signals. The procedure ends in step 299.
In Fig. 11, the main steps of an embodiment of a decoding method according
to the present invention are illustrated as a flow diagram. The procedure
o starts in step 200. In step 220, a received encoded main signal is decoded.
In
step 222, encoding schemes are provided, which comprise sub-frames with
differing lengths and/or order. A received side signal is decoded in step 224
by a selected encoding scheme. In step 226, the decoded main and side
signals are combined to a polyphonic signal. The procedure ends in step
5 299.
The embodiments described above are to be understood as a few illustrative
examples of the present invention. It will be understood by those skilled in
the
art that various modifications, combinations and changes may be made to the
o embodiments without departing from the scope of the present invention. In
particular, different part solutions in the different embodiments can be
combined in other configurations, where technically possible. The scope of the
present invention is, however, defined by the appended claims.
5 REFERENCES
European patent 0497413
US patent 5,285,498
US patent 5,434,948
o "Binaural cue coding applied to stereo and multi-channel audio
compression", 112th AES convention, May 2002, Munich, Germany by C.
Faller et al.