Language selection

Search

Patent 2644753 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent Application: (11) CA 2644753
(54) English Title: SYSTEM AND METHOD FOR PROVIDING ERROR RESILIENCE, RANDOM ACCESS AND RATE CONTROL IN SCALABLE VIDEO COMMUNICATIONS
(54) French Title: SYSTEME ET PROCEDE PERMETTANT DE FOURNIR LA ROBUSTESSE AUX ERREURS, L'ACCES DIRECT ET LA COMMANDE DE DEBIT DANS DES COMMUNICATIONS VIDEO ECHELONNABLES
Status: Dead
Bibliographic Data
(51) International Patent Classification (IPC):
  • H04B 1/66 (2006.01)
(72) Inventors :
  • ELEFTHERIADIS, ALEXANDROS (United States of America)
  • HONG, DANNY (United States of America)
  • SHAPIRO, OFER (United States of America)
  • WIEGAND, THOMAS (Germany)
(73) Owners :
  • VIDYO, INC. (United States of America)
(71) Applicants :
  • VIDYO, INC. (United States of America)
(74) Agent: BERESKIN & PARR LLP/S.E.N.C.R.L.,S.R.L.
(74) Associate agent:
(45) Issued:
(86) PCT Filing Date: 2007-03-05
(87) Open to Public Inspection: 2007-09-13
Examination requested: 2009-02-24
Availability of licence: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): Yes
(86) PCT Filing Number: PCT/US2007/063335
(87) International Publication Number: WO2007/103889
(85) National Entry: 2008-09-03

(30) Application Priority Data:
Application No. Country/Territory Date
60/778,760 United States of America 2006-03-03
PCT/US2006/061815 United States of America 2006-12-08
PCT/US2006/062569 United States of America 2006-12-22
60/884,148 United States of America 2007-01-09
PCT/US2007/062357 United States of America 2007-02-16
60/787,031 United States of America 2006-03-29
60/786,997 United States of America 2006-03-29
PCT/US2006/028365 United States of America 2006-07-20
PCT/US2006/028366 United States of America 2006-07-20
PCT/US2006/028367 United States of America 2006-07-20
PCT/US2006/028368 United States of America 2006-07-20
60/829,609 United States of America 2006-10-16
60/862,510 United States of America 2006-10-23

Abstracts

English Abstract

Systems and methods for error resilient transmission, rate control, and random access in video communication systems that use scalable video coding are provided. Error resilience is obtained by using information from low resolution layers to conceal or compensate loss of high resolution layer information. The same mechanism is used for rate control by selectively eliminating high resolution layer information from transmitted signals, which elimination can be compensated at the receiver using information from low resolution layers. Further, random access or switching between low and high resolutions is also achieved by using information from low resolution layers to compensate for high resolution spatial layer packets that may have not been received prior to the switching time.


French Abstract

Systèmes et procédés de transmission à robustesse aux erreurs, de commande de débit et d'accès direct dans des systèmes de communication vidéo reposant sur l'utilisation du codage vidéo échelonnable. La robustesse aux erreurs est obtenue par l'utilisation d'informations provenant de couches à faible résolution pour cacher ou compenser la perte d'informations de couches à haute résolution. Le même mécanisme est utilisé pour la commande de débit, par élimination sélective d'informations de couches à haute résolution dans des signaux transmis, cette élimination pouvant être compensée dans le récepteur à l'aide d'informations provenant de couches à faible résolution. En outre, l'accès direct ou la commutation entre la faible résolution et la haute résolution sont également obtenus par l'utilisation d'informations provenant de couches à faible résolution pour compenser les paquets de couche spatiale à haute résolution qui n'ont peut-être pas été reçus avant le moment de la commutation.

Claims

Note: Claims are shown in the official language in which they were submitted.




CLAIMS

1. A digital video decoding system, the system comprising:

a decoder that is capable of decoding a received digital video signal,
which is coded in a scalable video coding format supporting temporal
scalability and
at least one of spatial and quality scalability,

wherein the scalable video coding format for spatial scalability includes a
base spatial
and at least one spatial enhancement layer, for quality scalability includes a
base
quality layer and at least one quality enhancement layer, and for temporal
scalability
includes a base temporal layer and at least one temporal enhancement layer,
wherein
the base temporal layers and enhancement temporal layers are interlinked by a
threaded picture prediction structure for at least one of the spatial or
quality scalability
layers,

and wherein, for decoding a picture at a target spatial or quality layer
higher than the
corresponding base layer, the decoder is configured to use coded information
from a
layer lower than the target layer when a portion of the target layer's coded
information
is lost or not available.

2. The system of claim 1, wherein the digital video decoding system is
disposed in a receiving endpoint, the system further comprising:

a linking communication network;

a conferencing server linked to the receiving endpoint and at least one
transmitting endpoint by at least one communication channel each over the
communication network, and

36



at least one endpoint that transmits the coded digital video that is coded
in the scalable video coding format,

wherein the conferencing server is configured to selectively eliminate
portions of the
input video signals received from transmitting endpoints that correspond to
layers
higher than the base spatial or quality layer, prior to creating the output
video signal
that is forwarded to the receiving endpoint.

3. The system of claim 2 wherein the conferencing server linked to the
receiving endpoint and at least one transmitting endpoint is one of:

a Transcoding Multipoint Control Unit using cascaded decoding and
encoding;

a Switching Multipoint Control Unit by selecting which input to
transmit as output;

a Scalable Video Communication Server using selective multiplexing;
and

a Compositing Scalable Video Communication Server using selective
multiplexing and bitstream-level compositing.

4. The system of claim 2 wherein an encoder of the at least one
transmitting endpoint is configured to encode transmitted media as frames in a

threaded coding structure having a number of different temporal levels,
wherein a
subset of the frames ("R") is particularly selected for reliable transport and
includes at
least the frames of the lowest temporal layer in the threaded coding structure
so that
the decoder can decode at least a portion of received media based on a
reliably
received frame of the type R after packet loss or error and thereafter is
synchronized

37



with the encoder, and wherein the server selectively eliminates portions of
the input
video signals received from transmitting endpoints that correspond to layers
higher
than the base spatial or quality layer in non-R frames only, prior to creating
the output
video signal that is forwarded to the receiving endpoint.

5. The system of claim 2, wherein the conferencing server is further
configured to control the transmission rate of the output video signal that is
forwarded
to the at least one receiving endpoint so that the retained portions of the
input video
signals received from transmitting endpoints that correspond to layers higher
than the
base spatial or quality layer do not adversely affect the smoothness of the
output bit
rate.

6. The system of claim 2, wherein the selective elimination by the
conferencing server is performed according to desired output bit rate
requirements.
7. The system of claim 1, wherein the digital video decoding system is

disposed in a receiving endpoint, the system further comprising:

a transmitting endpoint that transmits coded digital video using a
scalable video coding format;

a communication network that links the transmitting endpoint with the
receiving endpoint,

wherein the transmitting endpoint is configured to selectively not transmit
portions of
its input video signal that correspond to layers higher than the base spatial
or quality
layer prior to creating the output video signal that is transmitted to the at
least one
receiving endpoint in order to achieve a desired output bit rate.

38



8. The system of claim 7 wherein the encoder of the transmitting
endpoint is configured to encode transmitted media as frames in a threaded
coding
structure having a number of different temporal levels, wherein a subset of
the frames
("R") is particularly selected for reliable transport and includes at least
the frames of
the lowest temporal layer in the threaded coding structure and such that the
decoder
can decode at least a portion of received media based on a reliably received
frame of
the type R after packet loss or error and thereafter is synchronized with the
encoder,
and wherein the encoder selectively does not transmit to the at least one
receiving
endpoint portions of its input video signal that correspond to layers higher
than the
base spatial or quality layer in non-R frames only.

9. The system of claim 7, wherein the transmitting endpoint is further
configured to control the transmission rate of the output video signal that is
forwarded
to the at least one receiving endpoint so that the retained portions of its
input video
signal that corresponds to layers higher than the base spatial or quality
layer do not
adversely affect the smoothness of the output bit rate.

10. The system of claim 7, wherein the decision for selective transmission
by the transmitting endpoint is performed according to desired output bit rate

requirements.

11. The system of claim 1, wherein the decoder is configured to display
the decoded output picture at a desired spatial resolution that falls in
between an
immediately lower and an immediately higher spatial layer provided by the
coded
video signal.

12. The system of claim 1, wherein the decoder is further configured to
operate the decoding loop of the immediately higher spatial layer at the
desired spatial
39




resolution by scaling all coded data of the immediately higher spatial layer
to the
desired spatial resolution, and wherein the resultant drift is eliminated by
using at
least one of:

periodic intra pictures;

periodic use of intra base layer mode; and

full resolution decoding of at least the lowest temporal layer of the
immediately higher spatial layer.

13. The system of claim 1, wherein the scalable video coding format is
further configured with at least one of:

periodic intra pictures,

periodic intra macroblocks, and
threaded picture prediction,

in order to avoid drift when the target layer's coded information that is
lost or is not available corresponds to the base temporal layer.

14. The system of claim 1, where the scalable video coding format is based
on hybrid coding such as in H.264, VC-1 or AVS standards, wherein the coded
information from a spatial or quality layer lower than the target layer used
by the
decoder when some or all of the target layer's coded information is lost or is
not
available comprises at least one of:

motion vector data, appropriately scaled for the target layer's
resolution;




coded prediction error difference, upsampled to the target layer's
resolution; and

intra data, upsampled to the target layer's resolution,

and wherein the decoder is further configured to use the target layer's
decoded
pictures as references in the decoding process in order to construct the
decoded output
picture, rather than the lower layer decoded reference pictures.

15. The system of claim 1, wherein the decoder is further configured to
operate at least one decoding loop for spatial or quality layers higher than
the target
spatial or quality layer for at least the base temporal layer, so that when
the decoder
switches target layers it can immediately display decoded pictures at the new
target
layer resolution.

16. A video communication system comprising:
a communication network,

a conferencing server disposed in the network and linked to at least one
receiving and at least one transmitting endpoint by at least one communication

channel each over the communication network,

at least one endpoint that transmits coded digital video using a scalable
video coding format, and

at least one receiving endpoint that is capable of decoding a digital
video signal coded in a scalable video coding format supporting temporal
scalability
and at least one of spatial and quality scalability,

wherein the scalable video coding format for spatial scalability includes a
base spatial
and at least one spatial enhancement layer, for quality scalability includes a
base

41



quality layer at least one quality enhancement layer, and for temporal
scalability
includes a base temporal layer and at least one temporal enhancement layer,
wherein
the base temporal layers and enhancement temporal layers are interlinked by a
threaded picture prediction structure for at least one of the spatial or
quality scalability
layers,

and wherein the conferencing server is configured to selectively eliminate or
modify
portions of the input video signals received from transmitting endpoints that
correspond to layers higher than the base spatial or quality layer, prior to
creating the
output video signal that is forwarded to the at least one receiving endpoint,
so that use
of lower spatial or quality layer data is signaled or explicitly coded in the
output video
signal for use in decoding pictures at resolutions higher than the base
spatial or quality
layer.

17. The system of claim 16, wherein the scalable video coding format
where the scalable video coding format is based on hybrid coding such as in
H.264,
VC-1 or AVS standards, and wherein the lower spatial or quality layer data
that is
signaled for use or explicitly coded in the output video signal forwarded to
the at least
one receiving endpoints is comprised of at least one of:

motion vector data,

coded prediction error difference,
intra data, and

reference picture indicators,
42



wherein the data is further appropriately scaled to the desired
target resolution when explicitly coded in the output video signal that is
transmitted to
the one or more receiving endpoints.

18. The system of claim 16 wherein the server is further configured to
create the output video signal that is forwarded to the at least one receiving
endpoint
as one of:

a Transcoding Multipoint Control Unit using cascaded decoding and
encoding;

a Switching Multipoint Control Unit by selecting which input to
transmit as output;

a Scalable Video Communication Server using selective multiplexing;
and

a Compositing Scalable Video Communication Server using selective
multiplexing and bitstream-level compositing.

19. The system of claim 16 wherein an encoder of the at least one
transmitting endpoint is configured to encode transmitted media as frames in a

threaded coding structure having a number of different temporal levels,
wherein a
subset of the frames ("R") is particularly selected for reliable transport and
includes at
least the frames of the lowest temporal layer in the threaded coding structure
and such
that the decoder can decode at least a portion of received media based on a
reliably
received frame of the type R after packet loss or error and thereafter is
synchronized
with the encoder, and wherein the server selectively eliminates portions of
the input
video signals received from transmitting endpoints that correspond to layers
higher

43



than the base spatial or quality layer in non-R frames only, prior to creating
the output
video signal that is forwarded to the at least one receiving endpoint.

20. The system of claim 16 wherein the conferencing server is further
configured to control the transmission rate of the output video signal that is
forwarded
to the at least one receiving endpoint so that the retained portions of the
input video
signals received from transmitting endpoints that correspond to layers higher
than the
base spatial or quality layer do not adversely affect the smoothness of the
output bit
rate.

21. The system of claim 16, wherein the selective elimination or
modification by the conferencing server is performed according to desired
output bit
rate requirements.

22. The system of claim 16, wherein the at least one receiving endpoint is
configured to display the decoded output picture at a desired spatial
resolution that
falls in between an immediately lower and an immediately higher spatial layer
provided by the received coded video signal.

23. The system of claim 22, wherein the at least one receiving endpoint is
further configured to operate the decoding loop of the immediately higher
spatial
layer at the desired spatial resolution by scaling all coded data of the
immediately
higher spatial layer to the desired spatial resolution, and wherein the
resultant drift is
eliminated by using at least one of:

periodic intra pictures,

periodic use of intra base layer mode,

44



full resolution decoding of at least the lowest temporal layer of
the immediately higher spatial layer.

24. The system of claim 16, wherein the scalable video coding format is
further configured with at least one of:

periodic intra pictures;

periodic intra macroblocks; and
threaded picture prediction;

in order to avoid drift when the higher than the base spatial or
quality layer's coded information that is modified or eliminated corresponds
to the
base temporal layer.

25. The system of claim 16, wherein the receiving endpoint is further
configured to operate at least one decoding loop for spatial or quality layers
higher
than the target spatial or quality layer for at least the base temporal layer,
so that when
the at least one receiving endpoint switches target layers it can immediately
display
decoded pictures at the new target layer resolution.

26. A video communication system comprising:
a communication network,

one endpoint that transmits coded digital video using a scalable video
coding format, and

at least one receiving endpoint that is capable of decoding a digital
video signal coded in a scalable video coding format supporting temporal
scalability
and at least one of spatial and quality scalability,




wherein the scalable video coding format for spatial scalability includes a
base spatial
and at least one spatial enhancement layer, for quality scalability includes a
base
quality layer at least one quality enhancement layer, and for temporal
scalability
includes a base temporal layer and at least one temporal enhancement layer,
wherein
the base temporal layers and enhancement temporal layers are interlinked by a
threaded picture prediction structure for at least one of the spatial or
quality scalability
layers, and

wherein the transmitting endpoint is configured to selectively eliminate or
modify
portions of its coded video signal that correspond to layers higher than the
base spatial
or quality layer, prior to creating the output video signal that is forwarded
to the at
least one receiving endpoint, so that use of lower spatial or quality layer
data is
signaled or explicitly coded in the output video signal for use in decoding
pictures at
resolutions higher than the base spatial or quality layer.

27. The system of claim 26, wherein the scalable video coding format is
based on hybrid coding such as in H.264, VC-1 or AVS standards, and wherein
the
lower spatial or quality layer data that is signaled for use or explicitly
coded in the
output video signal forwarded to the at least one receiving endpoints is
comprised of
at least one of:

motion vector data;

coded prediction error difference;
intra data; and

reference picture indicators,

46




wherein the data is further appropriately scaled to the desired target
resolution when
explicitly coded in the output video signal that is transmitted to the one or
more
receiving endpoints.

28. The system of claim 26 wherein the transmitting endpoint is
configured to encode transmitted media as frames in a threaded coding
structure
having a number of different temporal levels, wherein a subset of the frames
("R") is
particularly selected for reliable transport and includes at least the frames
of the
lowest temporal layer in the threaded coding structure and such that the
decoder can
decode at least a portion of received media based on a reliably received frame
of the
type R after packet loss or error and thereafter is synchronized with the
encoder, and
wherein the transmitting endpoint selectively eliminates portions of the input
video
signals received from transmitting endpoints that correspond to layers higher
than the
base spatial or quality layer in non- R frames only, prior to creating the
output video
signal that is transmitted to the at least one receiving endpoint.

29. The system of claim 26, wherein the transmitting endpoint is further
configured to control the transmission rate of the output video signal that is

transmitted to the at least one receiving endpoint so that the retained
portions of its
input video signal that correspond to layers higher than the base spatial or
quality
layer do not adversely affect the smoothness of the output bit rate.

30. The system of claim 26, wherein the selective elimination or
modification by the transmitting endpoint is performed according to desired
output bit
rate requirements.

31. The system of claim 26, wherein the at least one receiving endpoint is
configured to display the decoded output picture at a desired spatial
resolution that
47



falls in between an immediately lower and an immediately higher spatial layer
provided by the received coded video signal.

32. The system of claim 26, wherein the at least one receiving endpoint is
further configured to operate the decoding loop of the immediately higher
spatial
layer at the desired spatial resolution by scaling all coded data of the
immediately
higher spatial layer to the desired spatial resolution, and wherein the
resultant drift is
eliminated by using at least one of:

periodic intra pictures,

periodic use of intra base layer mode,

full resolution decoding of at least the lowest temporal layer of the
immediately higher spatial layer.

33. The system of claim 26, wherein the scalable video coding format is
further configured with at least one of:

periodic intra pictures;

periodic intra macroblocks; and
threaded picture prediction,

in order to avoid drift when the higher than the base spatial or quality
layer's coded
information that is modified or eliminated corresponds to the base temporal
layer.
34. The system of claim 26, wherein the receiving endpoint is further

configured to operate at least one decoding loop for spatial or quality layers
higher
than the target spatial or quality layer for at least the base temporal layer,
so that when
the at least one receiving endpoint switches target layers it can immediately
display
decoded pictures at the new target layer resolution.

48




35. A method for decoding a digital video signal, the digital video signal
coded in a scalable video coding format supporting temporal scalability and at
least
one of spatial and quality scalability,

wherein the scalable video coding format for spatial scalability includes a
base spatial
and at least one spatial enhancement layer, for quality scalability includes a
base
quality layer and at least one quality enhancement layer, and for temporal
scalability
includes a base temporal layer and at least one temporal enhancement layer,
wherein
the base temporal layers and enhancement temporal layers are interlinked by a
threaded picture prediction structure for at least one of the spatial or
quality scalability
layers,

the method comprising:

receiving the digital video signal at a decoder; and

for decoding a picture at a target spatial or quality layer higher than the
corresponding base layer, using coded information from a spatial or quality
layer
lower than the target layer in the threaded prediction structure when a
portion of the
target layer's coded information is lost or not available.

36. The method of claim 35, wherein the decoder is disposed in a receiving
endpoint in a linking communication network,

wherein a conferencing server is linked to the receiving endpoint and at least
one
transmitting endpoint by at least one communication channel each over the
communication network, and

49



wherein the at least one transmitting endpoint transmits the coded digital
video that is
coded in the scalable video coding format,

the method further comprising, at the conferencing server, selectively
eliminating portions of the input video signals received from transmitting
endpoints
that correspond to layers higher than the base spatial or quality layer, prior
to creating
the output video signal that is forwarded to the receiving endpoint.

37. The method of claim 36 wherein the conferencing server linked to the
receiving endpoint and at least one transmitting endpoint is one of:

a Transcoding Multipoint Control Unit using cascaded decoding and
encoding;

a Switching Multipoint Control Unit by selecting which input to
transmit as output;

a Scalable Video Communication Server using selective multiplexing;
and

a Compositing Scalable Video Communication Server using selective
multiplexing and bitstream-level compositing.

38. The method of claim 36, further comprising, at an encoder of the at
least one transmitting endpoint, encoding transmitted media as frames in a
threaded
coding structure having a number of different temporal levels, wherein a
subset of the
frames ("R") is particularly selected for reliable transport and includes at
least the
frames of the lowest temporal layer in the threaded coding structure so that
the
decoder can decode at least a portion of received media based on a reliably
received
frame of the type R after packet loss or error and thereafter is synchronized
with the




encoder, and wherein the server selectively eliminates portions of the input
video
signals received from transmitting endpoints that correspond to layers higher
than the
base spatial or quality layer in non-R frames only, prior to creating the
output video
signal that is forwarded to the receiving endpoint.

39. The method of claim 36, further comprising, at the conferencing server
controlling the transmission rate of the output video signal that is forwarded
to the at
least one receiving endpoint so that the retained portions of the input video
signals
received from transmitting endpoints that correspond to layers higher than the
base
spatial or quality layer do not adversely affect the smoothness of the output
bit rate.

40. The method of claim 36, wherein the selective elimination by the
conferencing server is performed according to desired output bit rate
requirements.
41. The method of claim 35,

wherein a transmitting endpoint transmits coded digital video using a scalable
video
coding format;

wherein a communication network links the transmitting endpoint with the
receiving
endpoint,

the method further comprising, at the transmitting endpoint, selectively not
transmitting portions of its input video signal that correspond to layers
higher than the
base spatial or quality layer, prior to creating the output video signal that
is
transmitted to the at least one receiving endpoint in order to achieve a
desired output
bit rate.

42. The method of claim 41, further comprising, at the transmitting
endpoint encoding transmitted media as frames in a threaded coding structure
having
51



a number of different temporal levels, wherein a subset of the frames ("R") is

particularly selected for reliable transport and includes at least the frames
of the
lowest temporal layer in the threaded coding structure and such that the
decoder can
decode at least a portion of received media based on a reliably received frame
of the
type R after packet loss or error and thereafter is synchronized with the
encoder, and
wherein the encoder selectively does not transmit to the at least one
receiving
endpoint portions of its input video signal that correspond to layers higher
than the
base spatial or quality layer in non-R frames only.

43. The method of claim 41, further comprising, at the transmitting
endpoint controlling the transmission rate of the output video signal that is
forwarded
to the at least one receiving endpoint so that the retained portions of its
input video
signal that corresponds to layers higher than the base spatial or quality
layer do not
adversely affect the smoothness of the output bit rate.

44. The method of claim 41, wherein the decision for selective
transmission by the transmitting endpoint is performed according to desired
output bit
rate requirements.

45. The method of claim 35, further comprising, at the decoder, displaying
the decoded output picture at a desired spatial resolution that falls in
between an
immediately lower and an immediately higher spatial layer provided by the
coded
video signal.

46. The method of claim 35, further comprising, at the decoder, operating
the decoding loop of the immediately higher spatial layer at the desired
spatial
resolution by scaling all coded data of the immediately higher spatial layer
to the

52



desired spatial resolution, and wherein the resultant drift is eliminated by
using at
least one of:

periodic intra pictures;

periodic use of intra base layer mode; and

full resolution decoding of at least the lowest temporal layer of the
immediately higher spatial layer.

47. The method of claim 35, wherein the scalable video coding format is
further configured with at least one of:

periodic intra pictures,

periodic intra macroblocks, and
threaded picture prediction,

in order to avoid drift when the target layer's coded information that is
lost or is not available corresponds to the base temporal layer.

48. The method of claim 35, where the scalable video
coding format is based on hybrid coding such as in H.264, VC-1 or AVS
standards, wherein the coded information from a spatial or quality layer lower

than the target layer used by the decoder when some or all of the target
layer's
coded information is lost or is not available comprises at least one of:

motion vector data, appropriately scaled for the target layer's
resolution;

coded prediction error difference, upsampled to the target
layer's resolution; and

53




intra data, upsampled to the target layer's resolution,

the method further comprising, at the decoder using the target layer's decoded
pictures
as references in the decoding process in order to construct the decoded output
picture,
rather that the lower layer decoded reference pictures.

49. The method of claim 35 further comprising, at the decoder operating at
least one decoding loops for spatial or quality layers higher than the target
spatial or
quality layer for at least the base temporal layer, so that when the decoder
switches
target layers it can immediately display decoded pictures at the new target
layer
resolution.

50. A method for video communication over a communication network,
having a conferencing server disposed therein and linked to at least one
receiving and
at least one transmitting endpoint by at least one communication channel each
over
the communication network, the at least one endpoint transmitting coded
digital video
using a scalable video coding format, and the at least one receiving endpoint
capable
of decoding a digital video signal coded in a scalable video coding format
supporting
temporal scalability and at least one of spatial and quality scalability,
wherein the
scalable video coding format for spatial scalability includes a base spatial
and at least
one spatial enhancement layer, for quality scalability includes a base quality
layer at
least one quality enhancement layer, and for temporal scalability includes a
base
temporal layer and at least one temporal enhancement layer, wherein the base
temporal layers and enhancement temporal layers are interlinked by a threaded
picture
prediction structure for at least one of the spatial or quality scalability
layers,

the method comprising:

54



at the conferencing server, selectively eliminating or modifying modify
portions of the input video signals received from transmitting endpoints that
correspond to layers higher than the base spatial or quality layer prior to
creating the
output video signal that is forwarded to the at least one receiving endpoint,
so that use
of lower spatial or quality layer data is signaled or explicitly coded in the
output video
signal for use in decoding pictures at resolutions higher than the base
spatial or quality
layer.

51. The method of claim 50, wherein the scalable video coding format is
based on hybrid coding such as in H.264, VC-1 or AVS standards, and wherein
the
lower spatial or quality layer data that is signaled for use or explicitly
coded in the
output video signal forwarded to the at least one receiving endpoints is
comprised of
at least one of:

motion vector data,

coded prediction error difference,
intra data, and

reference picture indicators,

wherein the data is further appropriately scaled to the desired
target resolution when explicitly coded in the output video signal that is
transmitted to
the one or more receiving endpoints.

52. The method of claim 50 wherein the server is further configured to
create the output video signal that is forwarded to the at least one receiving
endpoint
as one of:




a Transcoding Multipoint Control Unit using cascaded decoding and
encoding;

a Switching Multipoint Control Unit by selecting which input to
transmit as output;

a Scalable Video Communication Server using selective multiplexing;
and

a Compositing Scalable Video Communication Server using selective
multiplexing and bitstream-level compositing.

53. The method of claim 50, further comprising, at an encoder of the at
least one transmitting endpoint, encoding transmitted media as frames in a
threaded
coding structure having a number of different temporal levels, wherein a
subset of the
frames ("R") is particularly selected for reliable transport and includes at
least the
frames of the lowest temporal layer in the threaded coding structure and such
that the
decoder can decode at least a portion of received media based on a reliably
received
frame of the type R after packet loss or error and thereafter is synchronized
with the
encoder, and wherein the server selectively eliminates or modifies portions of
the
input video signals received from transmitting endpoints that correspond to
layers
higher than the base spatial or quality layer in non-R frames only, prior to
creating the
output video signal that is forwarded to the at least one receiving endpoint.

54. The method of claim 50, further comprising, at the conferencing server
controlling the transmission rate of the output video signal that is forwarded
to the at
least one receiving endpoint so that the retained portions of the input video
signals
received from transmitting endpoints that correspond to layers higher than the
base
spatial or quality layer do not adversely affect the smoothness of the output
bit rate.

56



55. The method of claim 50, further comprising, at the conferencing server
performing the selective elimination or modification according to desired
output bit
rate requirements.

56. The method of claim 50, further comprising, at the at least one
receiving endpoint displaying the decoded output picture at a desired spatial
resolution that falls in between an immediately lower and an immediately
higher
spatial layer provided by the received coded video signal.

57. The method of claim 56, further comprising, at the at least one
receiving endpoint, operating the decoding loop of the immediately higher
spatial
layer at the desired spatial resolution by scaling all coded data of the
immediately
higher spatial layer to the desired spatial resolution, and wherein the
resultant drift is
eliminated by using at least one of:

periodic intra pictures,

periodic use of intra base layer mode,

full resolution decoding of at least the lowest temporal layer of
the immediately higher spatial layer.

58. The method of claim 50, wherein the scalable video coding format is
further configured with at least one of:

periodic intra pictures;

periodic intra macroblocks; and
threaded picture prediction;

57




in order to avoid drift when the higher than the base spatial or
quality layer's coded information that is modified or eliminated corresponds
to the
base temporal layer.

59. The method of claim 50, further comprising, at the at least one
receiving endpoint operating at least one decoding loop for spatial or quality
layers
higher than the target spatial or quality layer for at least the base temporal
layer, so
that when the at least one receiving endpoint switches target layers it can
immediately
display decoded pictures at the new target layer resolution.

60. A video communication method comprising:
a communication network,

one endpoint that transmits coded digital video using a scalable
video coding format, and

at least one receiving endpoint that is capable of decoding a
digital video signal coded in a scalable video coding format supporting
temporal
scalability and at least one of spatial and quality scalability,

wherein the scalable video coding format for spatial scalability includes a
base spatial
and at least one spatial enhancement layer, for quality scalability includes a
base
quality layer at least one quality enhancement layer, and for temporal
scalability
includes a base temporal layer and at least one temporal enhancement layer,
wherein
the base temporal layers and enhancement temporal layers are interlinked by a
threaded picture prediction structure for at least one of the spatial or
quality scalability
layers, and

58



wherein the transmitting endpoint is configured to selectively eliminate or
modify
portions of its coded video signal that correspond to layers higher than the
base spatial
or quality layer, prior to creating the output video signal that is forwarded
to the at
least one receiving endpoint, so that use of lower spatial or quality layer
data is
signaled or explicitly coded in the output video signal for use in decoding
pictures at
resolutions higher than the base spatial or quality layer.

61. The method of claim 60, wherein the scalable video coding format is
based on hybrid coding such as in H.264, VC-1 or AVS standards, and wherein
the
lower spatial or quality layer data that is signaled for use or explicitly
coded in the
output video signal forwarded to the at least one receiving endpoints is
comprised of
at least one of:

motion vector data;

coded prediction error difference;
intra data; and

reference picture indicators,

wherein the data is further appropriately scaled to the desired target
resolution when
explicitly coded in the output video signal that is transmitted to the one or
more
receiving endpoints.

62. The method of claim 60, further comprising, at the transmitting
endpoint encoding transmitted media as frames in a threaded coding structure
having
a number of different temporal levels, wherein a subset of the frames ("R") is

particularly selected for reliable transport and includes at least the frames
of the
lowest temporal layer in the threaded coding structure and such that the
decoder can

59



decode at least a portion of received media based on a reliably received frame
of the
type R after packet loss or error and thereafter is synchronized with the
encoder, and
wherein the transmitting endpoint selectively eliminates or modifies portions
of its
input video signal that correspond to layers higher than the base spatial or
quality
layer in non- R frames only, prior to creating the output video signal that is

transmitted to the at least one receiving endpoint.

63. The method of claim 60, further comprising, at the transmitting
endpoint controlling the transmission rate of the output video signal that is
transmitted
to the at least one receiving endpoint so that the retained portions of its
input video
signal that correspond to layers higher than the base spatial or quality layer
do not
adversely affect the smoothness of the output bit rate.

64. The method of claim 60, further comprising, at the transmitting
endpoint performing the selective elimination or modification according to
desired
output bit rate requirements.

65. The method of claim 60, further comprising, at the at least one
receiving endpoint displaying the decoded output picture at a desired spatial
resolution that falls in between an immediately lower and an immediately
higher
spatial layer provided by the received coded video signal.

66. The method of claim 65, further comprising, at the at least one
receiving endpoint operating the decoding loop of the immediately higher
spatial
layer at the desired spatial resolution by scaling all coded data of the
immediately
higher spatial layer to the desired spatial resolution, and wherein the
resultant drift is
eliminated by using at least one of:

periodic intra pictures,




periodic use of intra base layer mode,

full resolution decoding of at least the lowest temporal layer of the
immediately higher spatial layer.

67. The method of claim 60, wherein the scalable video coding format is
further configured with at least one of:

periodic intra pictures;

periodic intra macroblocks; and
threaded picture prediction,

in order to avoid drift when the higher than the base spatial or quality
layer's coded
information that is modified or eliminated corresponds to the base temporal
layer.
68. The method of claim 60, further comprising, at the receiving endpoint

operating at least one decoding loop for spatial or quality layers higher than
the target
spatial or quality layer for at least the base temporal layer, so that when
the at least
one receiving endpoint switches target layers it can immediately display
decoded
pictures at the new target layer resolution.

69. Computer readable media comprising a set of instructions to perform
the steps recited in at least one of the method claims 35-68.


61

Description

Note: Descriptions are shown in the official language in which they were submitted.



CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
SYSTEM AND METHOD FOR PROVIDING ERROR RESILIENCE,
RANDOM ACCESS AND RATE CONTROL IN
SCALABLE VIDEO COMMUNICATIONS

CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of United States provisional patent
application Serial No 60/778,760, filed March 3, 2006, of provisional patent
application Serial No. 60/787,03 1, filed March 29, 2006, and of provisional
patent

application Serial No. 60/862,5 10 filed October 23, 2006. Further, this
application is
claims the benefit of related International patent application Nos.
PCT/US06/28365,
PCT/US06/028366, PCT/US06/028367, PCT/US06/028368, PCT/US06/061815,
PCT/US06/62569, and PCT/US07/62357, and U.S. provisional patent application
Nos. 60/884,148, 60/786,997, and 60/829,609. All of the aforementioned
priority and

related applications, which are commonly assigned, are hereby incorporated by
reference herein in their entireties.

FIELD OF THE INVENTION

[0002] The present invention relates to video data communication systems. The
invention specifically relates to simultaneously providing error resilience,
random

access, and rate control capabilities in video communication systems utilizing
scalable
video coding techniques.

BACKGROUND OF THE INVENTION

[0003] Transmission of digital video on packet-based networks such as those
based
on the Internet Protocol (IP) is extremely challenging, at least due to the
fact that data
transport is typically done on a best-effort basis. In modern packet-based
communication systems errors typically exhibit themselves as packet losses and
not

1


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
bit errors. Furthermore, such packet losses are typically the result of
congestion in
intermediary routers, and not the result of physical layer errors (one
exception to this
is wireless and cellular networks). When an error in transmission or receipt
of a video
signal occurs, it is important to ensure that the receiver can quickly recover
from the

error and return to an error-free display of the incoming video signal.
However, in
typical digital video communication systems, the receiver's robustness is
reduced by
the fact that the incoming data is heavily compressed in order to conserve
bandwidth.
Further, the video compression techniques employed in the communication
systems
(e.g., state-of-the-art codecs ITU-T H.264 and H.263 or ISO MPEG-2 and MPEG-4

codecs) can create a very strong temporal dependency between sequential video
packets or frames. In particular, use of motion compensated prediction (e.g.,
involving the use of P or B frames) codecs creates a chain of frame
dependencies in
which a displayed frame depends on past frame(s). The chain of dependencies
can
extend all the way to the beginning of the video sequence. As a result of the
chain of

dependencies, the loss of a given packet can affect the decoding of a number
of the
subsequent packets at the receiver. Error propagation due to the loss of the
given
packet terminates only at an "intra" (I) refresh point, or at a frame that
does not use
any temporal prediction at all.

[0004] Error resilience in digital video communication systems requires having
at
least some level of redundancy in the transmitted signals. However, this
requirement
is contrary to the goals of video compression techniques, which strive to
eliminate or
minimize redundancy in the transmitted signals.

[0005] On a network that offers differentiated services (e.g., DiffServ IP-
based
networks, private networks over leased lines, etc.), a video data
communication

application may exploit network features to deliver some or all of video
signal data in
2


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335

a lossless or nearly lossless manner to a receiver. However, in an arbitrary
best-effort
network (such as the Internet) that has no provision for differentiated
services, a data
communication application has to rely on its own features for achieving error
resilience. Known techniques (e.g., the Transmission Control Protocol - TCP)
that are

useful in generic data communications are not appropriate for video or audio
communications, which have the added constraint of low end-to-end delay
arising out
of human interface requirements. For example, TCP techniques may be used for
error
resilience in data transport using the File Transfer Protocol. TCP keeps on

retransmitting data until confirmation that all data is received, even if it
involves a

delay is several seconds. However, TCP is inappropriate for video data
transport in a
live or interactive videoconferencing application because the end-to-end
delay, which
is unbounded, would be unacceptable to participants.

[0006] A related problem is that of random access. Assume that a receiver
joins
an existing transmission of a video signal. Typical instances are when a user
who
joins a videoconference, or a user who tunes in to a broadcast. Such a user
would

have to find a point in the incoming bitstream where he/she can start decoding
and be
in synchronization with the encoder. Providing such random access points,
however,
has a considerable impact on compression efficiency. Note that a random access
point is, by definition, an error resilience feature since at that point any
error

propagation terminates (i.e., it is an error recovery point). Hence, the
better the
random access support provided by a particular coding scheme, the faster error
recovery the coding scheme can provide. The converse may not always be true;
it
depends on the assumptions made about the duration and extent of the errors
that the
error resilience technique has been designed to address. For error resilience,
some

3


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
state information could be assumed to be available at the receiver at the time
the error
occurred.

[0007] As an example, in MPEG-2 video codecs for digital television systems
(digital cable TV or satellite TV), I pictures are used at periodic intervals
(typically
0.5 sec) to enable fast switching into a stream. The I pictures, however, are

considerably larger than their P or B counterparts (typically by 3-6 times)
and are thus
to be avoided, especially in low bandwidth and/or low delay applications.

[0008] In interactive applications such as videoconferencing, the concept of
requesting an intra update is often used for error resilience. In operation,
the update
involves a request from the receiver to the sender for an intra picture
transmission,

which enables the decoder to be synchronized. The bandwidth overhead of this
operation is significant. Additionally, this overhead is also incurred when
packet
errors occur. If the packet losses are caused by congestion, then the use of
the intra
pictures only exacerbates the congestion problem.

[0009] Another traditional technique for error resilience, which has been used
in
the past (e.g., in the H.261 standard) to mitigate drift caused by mismatch in
IDCT
implementations, is to periodically code each macroblock in intra mode. The
H.261
standard requires forced intra coding every 132 times a macroblock is
transmitted.
[0010] The coding efficiency decreases with increasing percentage of

macroblocks that are forced to be coded as intra in a given frame. Conversely,
when
this percentage is low, the time to recover from a packet loss increases. The
forced
intra coding process requires extra care to avoid motion-related drift, which
further
limits the encoder's performance since some motion vector values have to be
avoided,
even if they are the most effective.

4


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
[0011] In addition to traditional, single-layer codecs, layered or scalable
coding is
a well-known technique in multimedia data encoding. Scalable coding is used to
generate two or more "scaled" bitstreams collectively representing a given
medium in
a bandwidth-efficient manner. Scalability can be provided in a number of
different

dimensions, namely temporally, spatially, and quality (also referred to as SNR
"Signal-to-Noise Ratio" scalability or fidelity scalability). For example, a
video
signal may be scalably coded in different layers at CIF and QCIF resolutions,
and at
frame rates of 7.5, 15, and 30 frames per second (fps). Depending on the
codec's
structure, any combination of spatial resolutions and frame rates may be
obtainable

from the codec bitstream. The bits corresponding to the different layers can
be
transmitted as separate bitstreams (i.e., one stream per layer) or they can be
multiplexed together in one or more bitstreams. For convenience in description
herein, the coded bits corresponding to a given layer may be referred to as
that layer's
bitstream, even if the various layers are multiplexed and transmitted in a
single

bitstream. Codecs specifically designed to offer scalability features include,
for
example, MPEG-2 (ISO/IEC 13818-2, also known as ITU-T H.262) and the currently
developed SVC (known as ITU-T H.264 Annex G or MPEG-4 Part 10 SVC).
Scalable coding techniques specifically designed for video communication are
described in commonly assigned international patent application No.

PCT/US06/028365, "SYSTEM AND METHOD FOR SCALABLE AND LOW-
DELAY VIDEOCONFERENCING USING SCALABLE VIDEO CODING". It is
noted that even codecs that are not specifically designed to be scalable can
exhibit
scalability characteristics in the temporal dimension. For example, consider
an
MPEG-2 Main Profile codec, a non-scalable codec, which is used in DVDs and

digital TV environments. Further, assume that the codec is operated at 30 fps
and that
5


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
a group of pictures (GOP) structure of IBBPBBPBBPBBPBB (period N= 15 frames)
is used. By sequential elimination of the B pictures, followed by elimination
of the P
pictures, it is possible to derive a total of three temporal resolutions: 30
fps (all picture
types included), 10 fps (I and P only), and 2 fps (I only). The sequential
elimination

process results in a decodable bitstream because the MPEG-2 Main Profile codec
is
designed so that coding of the P pictures does not rely on the B pictures, and
similarly
coding of the I pictures does not rely on other P or B pictures. In the
following,
single-layer codecs with temporal scalability features are considered to be a
special
case of scalable video coding, and are thus included in the term scalable
video coding,
unless explicitly indicated otherwise.

[0012] Scalable codecs typically have a pyramidal bitstream structure in which
one of the constituent bitstreams (called the "base layer") is essential in
recovering the
original medium at some basic quality. Use of one or more the remaining
bitstream(s)
(hereinafter called "the enhancement layer(s)") along with the base layer
increases the

quality of the recovered medium. Data losses in the enhancement layers may be
tolerable, but data losses in the base layer can cause significant distortions
or
complete loss of the recovered medium.

[0013] Scalable codecs pose challenges similar to those posed by single layer
codecs for error resilience and random access. However, the coding structures
of the
scalable codecs have unique characteristics that are not present in single
layer video

codecs. Further, unlike single layer coding, scalable coding may involve
switching
from one scalability layer to another (e.g., switching back and forth between
CIF and
QCIF resolutions). Instantaneous layer switching when switching between
different
resolutions with very little bit rate overhead is desirable for random access
in scalable
6


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
coding systems in which multiple signal resolutions (spatial/temporal/quality)
may be
available from the encoder.

[0014] A problem related to those of error resilience and random access is
that of
rate control. The output of a typical video encoder has a variable bit rate,
due to the

extensive use of prediction, transform and entropy coding techniques. In order
to
construct a constant bit rate stream, buffer-constrained rate control is
typically
employed in a video communication system. In such a system, an output buffer
at the
encoder is assumed, which is emptied at a constant rate (the channel rate);
the encoder
monitors the buffer's occupancy and makes parameter selections (e.g.,
quantizer step

size) in order to avoid buffer overflow or underflow. Such a rate control
mechanism,
however, can only be applied at the encoder, and further assumes that the
desired
output rate is known. In some video communication applications, including
videoconferencing, it is desirable that such rate control decisions are made
at an
intermediate gateway (e.g., at a Multipoint Control Unit - MCU), which is
situated

between the sender and the receiver. Bitstream-level manipulation, or
transcoding,
can be used at the gateway, but at considerable processing and complexity
cost. It is
therefore desirable to employ a technique that achieves rate control without
requiring
any additional processing at the intermediate gateway.

[0015] Consideration is now being given to improving error resilience and
capabilities for random access to the coded bitstreams, and rate control in
video
communications systems. Attention is directed developing error resilience,
rate
control, and random access techniques, which have a minimal impact on end-to-
end
delay and the bandwidth used by the system.


7


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
SUMMARY OF THE INVENTION

[0016] The present invention provides systems and methods to increase error
resilience and provide random access and rate control capabilities in video
communication systems that use scalable video coding. The systems and methods

also allow the derivation of an output signal at a resolution different than
the coded
resolutions, with excellent rate-distortion performance.

[0017] In one embodiment, the present invention provides a mechanism to
recover
from loss of packets of a high resolution spatially scalable layer by using
information
from the low resolution spatial layer. In another embodiment, the present
invention

provides a mechanism to switch from a low spatial or SNR resolution to a high
spatial
or SNR resolution with little or no delay. In yet another embodiment, the
present
invention provides a mechanism for performing rate control, in which the
encoder or
an intermediate gateway (e.g., an MCU) selectively eliminates packets from the
high
resolution spatial layer, anticipating the use of appropriate error recovery
mechanisms

at the receiver that minimize the impact of the lost packets on the quality of
the
received signal. In yet another embodiment, the encoder or an intermediate
gateway
selectively replaces packets from the high resolution spatial layer with
information
that effectively instructs the encoder to reconstruct an approximation to the
high
resolution data being replaced using information from the base layer and past
frames

of the enhancement layer. In another embodiment, the present invention
describes a
mechanism for deriving an output video signal at a resolution different than
the coded
resolutions, and specifically an intermediate resolution between those used
for
spatially scalable coding. These embodiments, either alone or in combination,
allow
the construction of video communication systems with significant rate control
and

resolution flexibility as well as error resilience and random access.
8


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
[0018] The inventive systems and methods are based on "error concealment"
techniques in conjunction with scalable coding techniques. The techniques
simultaneously achieve error resilience and rate control for a particular
family of
video encoders referred to as scalable video encoders. The rate-distortion

performance of the error concealment techniques is such that it matches or
exceeds
that of coding at the effective transfer rate (total transmitted minus the
rate of the lost
packets). By appropriate selection of picture coding structures and transport
modes
the techniques allow nearly instantaneous layer switching with very little bit
rate
overhead.

[0019] Further, the techniques can be used to derive a decoded version of the
received
signal at a resolution different than the coded resolution(s). This allows,
for example,
the creation of a'/z CIF (HCIF) signal out of a spatially scalable coded
signal at QCIF
and CIF resolutions. In contrast with typical scalable coding, the receiver
would

either have to use the QCIF signal and upsample it (with poor quality), or use
the CIF
signal and downsample it (with good quality but high bit rate utilization).
The same
problem also exists if the QCIF and CIF are simulcast as single-layer streams.

[0020] The techniques also provide rate control with minimal processing of the
encoded video bitstream without adversely affecting picture quality.

9


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
BRIEF DESCRIPTION OF THE DRAWINGS

[0021] Further features, the nature, and various advantages of the invention
will be
more apparent from the following detailed description of the preferred
embodiments
and the accompanying drawings in which:

[0022] FIG. 1 is a block diagram illustrating the overall architecture of a
videoconferencing system in accordance with the principles of the present
invention;
[0023] FIG. 2 is a block diagram illustrating an exemplary end-user terminal
in
accordance with the principles of the present invention;

[0024] FIG. 3 is a block diagram illustrating an exemplary architecture of a
video
encoder (base and temporal enhancement layers) in accordance with the
principles of
the present invention;

[0025] FIG. 4 is a diagram illustrating an exemplary picture coding structure
in
accordance with the principles of the present invention; FIG. 5 is a diagram
illustrating an example of an alternative picture coding structure in
accordance with
the principles of the present invention;

[0026] FIG. 6 is a block diagram illustrating an exemplary architecture of a
video
encoder for a spatial enhancement layer in accordance with the principles of
the
present invention;

[0027] FIG. 7 is a diagram illustrating an exemplary picture coding structure
when
spatial scalability is used in accordance with the principles of the present
invention;
[0028] FIG. 8 is a diagram illustrating an exemplary decoding process with
concealment of enhancement layer pictures in accordance with the principles of
the
present invention;



CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
[0029] FIG. 9 is a diagram illustrating exemplary R-D curves of the
concealment
process when applied to the `Foreman' sequence in accordance with the
principles of
the present invention;

[0030] FIG. 10 is a diagram illustrating an exemplary picture coding structure
when
spatial scalability with SR pictures is used in accordance with the principles
of the
present invention.

[0031] Throughout the Figures the same reference numerals and characters,
unless
otherwise stated, are used to denote like features, elements, components or
portions of
the illustrated embodiments. Moreover, while the present invention will now be

described in detail with reference to the Figures, it is done so in connection
with the
illustrative embodiments.

DETAILED DESCRIPTION OF THE INVENTION

[0032] Systems and methods are provided for error resilient transmission,
random
access and rate control in video communication systems. The systems and
methods
exploit error concealment techniques based on features of scalable video
coding,
which may be used in the video communication systems.

[0033] In a preferred embodiment, an exemplary video communication system
may be a multi-point videoconferencing system 10 operated over a packet-based
network. (See e.g., FIG. 1). Multi-point videoconferencing system may include

optional bridges 120a and 120b (e.g., Multipoint Control Unit (MCU) or
Scalable
Video Communication Server (SVCS)) to mediate scalable multilayer or single
layer
video communications between endpoints (e.g., users 1-k and 1-m) over the
network.
The operation of the exemplary video communication system is the same and as

advantageous for a point-to-point connection with or without the use of
optional
11


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
bridges 120a and 120b. The techniques described in this invention can be
applied
directly to all other video communication applications, including point-to-
point
streaming, broadcasting, multicasting, etc.

[0034] A detailed description of scalable video coding techniques and

videoconferencing systems based on scalable video coding is provided, for
example,
in commonly assigned International patent application Nos. PCT/US06/28365 and
PCT/US06/28366. Further, descriptions of scalable video coding techniques and
videoconferencing systems based on scalable video coding are provided in
commonly
assigned International patent application Nos. PCT/US06/62569 and

PCT/US06/061815.

[0035] FIG. 1 shows the general structure of a videoconferencing system 10.
Videoconferencing system 10 includes a plurality of end-user terminals (e.g.,
users 1-
k and users 1-m) that are linked over a network 100 via LANs 1 and 2 and
servers
120a and 120b. The servers may be traditional MCUs, or Scalable Video Coding

servers (SVCS) or Compositing Scalable Video Coding servers (CSVCS). The
latter
servers have the same purpose as traditional MCUs, but with significantly
reduced
complexity and improved functionality. (See e.g., International patent
application
Nos. PCT/US06/28366 and PCT/US06/62569). In the description herein, the term
"server" may be used generically to refer to either an SVCS or an CSVCS.

[0036] FIG. 2 shows the architecture of an end-user terminal 140, which is
designed for use with videoconferencing systems (e.g., system 100) based on
multi
layer coding. Terminal 140 includes human interface input/output devices
(e.g., a
camera 210A, a microphone 210B, a video display 250C, a speaker 250D), and one
or
more network interface controller cards (NICs) 230 coupled to input and output
signal

multiplexer and demultiplexer units (e.g., packet MUX 220A and packet DMUX
12


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
220B). NIC 230 may be a standard hardware component, such as an Ethernet LAN
adapter, or any other suitable network interface device, or a combination
thereof.
[0037] Camera 210A and microphone 210B are designed to capture participant
video and audio signals, respectively, for transmission to other conferencing

participants. Conversely, video display 250C and speaker 250D are designed to
display and play back video and audio signals received from other
participants,
respectively. Video display 250C may also be configured to optionally display
participant/terminal 140's own video. Camera 210A and microphone 210B outputs
are coupled to video and audio encoders 210G and 210H via analog-to-digital

converters 210E and 210F, respectively. Video and audio encoders 210G and 210H
are designed to compress input video and audio digital signals in order to
reduce the
bandwidths necessary for transmission of the signals over the electronic
communications network. The input video signal may be live, or pre-recorded
and
stored video signals. The encoders compress the local digital signals in order
to

minimize the bandwidth necessary for transmission of the signals.

[0038] In an exemplary embodiment of the present invention, the audio signal
may be encoded using any suitable technique known in the art (e.g., G.711,
G.729,
G.729EV, MPEG-1, etc.). In a preferred embodiment of the present invention,
the
scalable audio codec G.729EV is employed by audio encoder 210G to encode audio

signals. The output of audio encoder 210G is sent to multiplexer 1VIUX 220A
for
transmission over network 100 via NIC 230.

[0039] Packet MUX 220A may perform traditional multiplexing using the RTP
protocol. Packet 1VIUX 220A may also perform any related Quality of Service
(QoS)
processing that may be offered by network 100 or directly by a video
communication

application (see e.g. International patent application No. PCT/US06/061815).
Each
13


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
stream of data from terminal 140 is transmitted in its own virtual channel or
"port
number" in IP terminology.

[0040] Video encoder 210G is a scalable video encoder that has multiple
outputs,
corresponding to the various layers (here labeled "base" and "enhancement").
It is
noted that simulcasting is a special case of scalable coding, where no inter
layer

prediction takes place. In the following, when the term scalable coding is
used, it
includes the simulcasting case. The operation of the video encoder and the
nature of
the multiple outputs are described in more detail herein below.

[0041] In the H.264 standard specification, it is possible to combine views of

multiple participants in a single coded picture by using a flexible macroblock
ordering
(FMO) scheme. In this scheme, each participant occupies a portion of the coded
image corresponding to one of its slices. Conceptually, a single decoder can
be used
to decode all participant signals. However, from a practical view, the
receiver/terminal will have to decode several smaller independently coded
slices.

Thus, terminal 140 shown in FIG. 2 with decoders 230A may be used in
applications
of the H.264 specification. It is noted that the server for forwarding slices
is a
CSVCS.

[0042] In terminal 140, demultiplexer DMUX 220B receives packets from NIC
320 and redirects them to the appropriate decoder unit 230A.

[0043] The SERVER CONTROL block in terminal 140 coordinates the
interaction between the server (SVCS/CSVCS) and the end-user terminals as
described in International patent applications Nos. PCT/US06/028366 and
PCT/US06/62569. In a point-to-point communication system without intermediate
servers, the SERVER CONTROL block is not needed. Similarly, in non-

conferencing applications, point-to-point conferencing applications, or when a
14


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
CSVCS is used, only a single decoder may be needed at a receiving end-user
terminal.
For applications involving stored video (e.g., broadcast of pre-recorded, pre-
coded
material, the transmitting end-user terminal may not involve the entire
functionality of
the audio and video encoding blocks and all blocks preceding them (camera,

microphone, etc.). Specifically, only the portions related to selective
transmission of
video packets, as explained below, need to be provided.

[0044] Although the word "terminal" is used in this context, the various
components of the terminal may be separate devices that are interconnected to
each
other, they may be integrated in a personal computer in software or hardware,
or they
could be combinations thereof.

[0045] FIG. 3 shows an exemplary base layer video encoder 300. Encoder 300
includes a FRAME BUFFERS block 310 and an Encoder Reference Control (ENC
REF CONTROL) block 320 in addition to conventional "text-book" variety video
coding process blocks 330 for motion estimation (ME), motion compensation
(MC),

and other encoding functions. Video encoder 300 may be designed, for example,
according to the H.264/MPEG-4 AVC (ITU-T and ISO/IEC JTC 1, "Advanced video
coding for generic audiovisual services," ITU-T Recommendation H.264 and
ISO/IEC 14496-10 (MPEG4-AVC)) or SVC (J. Reichel, H. Schwarz, and M. Wien,
"Joint Scalable Video Model JSVM 4," JVT-Q202, Document of Joint Video Team

(JVT) of ITU T SG16/Q.6 and ISO/IEC JTC 1/SC 29/WG 11, October 2005). It will
be understood that any other suitable codecs or designs can be used for the
video
encoder, including, for example, the designs disclosed in International patent
applications Nos. PCT/US06/28365 and PCT/US06/62569. If spatial scalability is
used, then a DOWNSAMPLER is optionally used at the input to reduce the input

resolution (e.g., from CIF to QCIF).



CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
[0046] ENC REF CONTROL block 300 is used to create a "threaded" coding
structure. (See e.g., International patent application No. PCT/US06/28365).
Standard
block-based motion-compensated codecs have a regular structure of I, P, and B
frames. For example, in a picture sequence (in display order) such as IBBPBBP,
the

`P' frames are predicted from the previous P or I frame in the sequence,
whereas the
B pictures are predicted using both the previous and next P or I frame.
Although the
number of B pictures between successive I or P pictures can vary, as can the
rate at
which I pictures appear, it is not possible, for example, for a P picture to
use as a
reference for prediction another P picture that is earlier in time than the
most recent

one. The H.264 coding standard advantageously provides an exception in that
two
reference picture lists are maintained by the encoder and decoder,
respectively, with
appropriate signaling information that provide for reordering and selective
use of
pictures from within those lists. This exception can be exploited to select
which
pictures are used as references and also which references are used for a
particular

picture that is to be coded. In FIG. 3, FRAME BUFFERS block 310 represents
memory for storing the reference picture list(s). ENC REF CONTROL block 320 is
designed to determine which reference picture is to be used for the current
picture at
the encoder side.

[0047] The operation of ENC REF CONTROL block 320 is placed in further context
with reference to an exemplary layered picture coding "threading" or
"prediction
chain" structure 400 shown in FIG. 4, in which the letter `L' is used to
indicate an
arbitrary scalability layer, followed by a number to indicate the temporal
layer (0
being the lowest, or coarsest). The arrows indicate the direction, source, and
target of
prediction. LO is simply a series of regular P pictures spaced four pictures
apart. L1

has the same frame rate, but prediction is only allowed from the previous LO
frame.
16


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
L2 frames are predicted from the most recent LO or L1 frame. LO provides one
fourth
(1:4) of the full temporal resolution, L1 doubles the LO frame rate (1:2), and
L2
doubles the L0+L1 frame rate (1:1).

[0048] Additional or fewer layers can be similarly constructed to accommodate
different bit rate/scalability requirements, depending on the requirements of
the
specific implementation of the present invention. A simple example is shown in
FIG.

5 where a traditional prediction series of IPPP... frames is converted to two
layers.
[0049] Codecs 300 utilized in implementations of the present invention may be
configured to generate a set of separate picture "threads" (e.g., a set of
three threads

410-430) in order to enable multiple levels of temporal scalability
resolutions (e.g.,
L0-L2) and other enhancement resolutions (e.g., S0-S2). A thread or prediction
chain
is defined as a sequence of pictures that are motion-compensated using
pictures either
from the same thread, or pictures from a lower level thread. The arrows in
FIG. 4
indicate the direction, source, and target of prediction for three threads 410-
430.

Threads 410-420 have a common source LO but different targets and paths (e.g.,
targets L2, L2, and LO, respectively). The use of threads allows the
implementation
of temporal scalability, since any number of top-level threads can be
eliminated
without affecting the decoding process of the remaining threads.

[0050] It is noted that in encoder 300, ENC REF CONTROL block may use only P
pictures as reference pictures. The use of B pictures with both forward and
backward
prediction increases the coding delay by the time it takes to capture and
encode the
reference pictures used for the B pictures. In traditional interactive
communications,
the use of B pictures with prediction from future pictures increases the
coding delay
and is therefore avoided. However, B pictures also may be used with
accompanying

gains in overall compression efficiency. Using even a single B picture in the
set of
17


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
threads (e.g., by having L2 be coded as a B picture) can improve compression
efficiency. For applications that are not delay-sensitive, some or all
pictures (with the
possible exception of LO) can be B pictures with bi-directional prediction. It
is noted
that specifically with the H.264 standard, it is possible to use B pictures
without

incurring extra delay, as the standard allows the use of two motion vectors
that both
use reference pictures that are in the past in display order. In this case,
such B
pictures can be used without increasing the coding delay compared with P
picture
coding. Similarly, the LO pictures could be I pictures, forming traditional
groups of
pictures (GOPs).

[0051] With renewed reference to FIG. 3, base layer encoder 300 can be
augmented
to create spatial and/or quality enhancement layers, as described, for example
in the
H.264 SVC Standard draft and in International patent application No.

PCT/US06/28365. FIG. 6 shows the structure of an exemplary encoder 600 for
creating the spatial enhancement layer. The structure of encoder 600 is
similar to that
of base layer codec 300, with the additional feature that the base layer
information is

also made available to encoder 600. This information may include motion vector
data, macroblock mode data, coded prediction error data, and reconstructed
pixel data.
Encoder 600 can re-use some or all of this information in order to make coding
decisions for the enhancement layer. For this purpose, the base layer data has
to be

scaled to the target resolution of the enhancement layer (e.g., by factor of 2
if the base
layer is QCIF and the enhancement layer is CIF). Although spatial scalability
usually
requires two coding loops to be maintained, it is possible (e.g., under the
H.264 SVC
draft standard) to perform single-loop decoding by limiting the base layer
data that is
used for enhancement layer coding to only values that are computable from the

information encoded in the current picture's base layer. For example, if a
base layer
18


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
macroblock is inter-coded, then the enhancement layer cannot use the
reconstructed
pixels of that macroblock as a basis for prediction. It can, however, use its
motion
vectors and the prediction error values since they are obtainable by just
decoding the
information contained in the current base layer picture. Single-loop decoding
is

desirable since the complexity of the decoder is significantly decreased.

[0052] The threading structure can be utilized for the enhancement layer
frames in the
same manner as for the base layer frames. FIG. 7 shows an exemplary threading
structure 700 for the enhancement layer frames following the design shown in
FIG. 4.
In FIG. 7, the enhancement layer blocks in structure 700 are indicated by the
letter

`S'. It is noted that threading structures for the enhancement layer frames
and the
base layer can be different, as explained in International patent application
No.
PCT/US06/28365.

[0053] Further, similar enhancement layer codecs for quality scalability can
be
constructed, for example, as described in the SVC draft standard and described
in
International patent application No. PCT/US06/28365. In such codecs for
quality

scalability, instead of building the enhancement layer on a higher resolution
version
of the input, the enhancement layer is built by coding the residual prediction
error at
the same spatial resolution as the input. As with spatial scalability, all the
macroblock
data of the base layer can be re-used at the enhancement layer for quality
scalability,

in either single- or dual-loop coding configurations.

[0054] For brevity, the following description is limited to spatial
scalability, but it
will be understood that the described techniques also can be applied to
quality or
fidelity scalability.

[0055] It is noted that due to the inherent temporal dependency arising from
motion-
compensated prediction in state-of-the-art video codecs, any packet losses at
a given
19


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
picture will not only affect the quality of that particular picture, but will
also affect all
future pictures for which the given picture acts as a reference, either
directly or
indirectly. This is because the reference frame that the decoder can construct
for
future predictions will not be the same as the one used at the encoder. The
ensuing

difference, or drift, can have tremendous impact on the visual quality of the
decoded
video signals. However, as described in International patent application Nos.
PCT/US06/28365 and PCT/US06/061815, structure 400 (FIG. 4) has distinct
advantages in terms of robustness in the presence of transmission errors.

[0056] As shown in FIG. 4, threading structure 400 creates three self-
contained

chains of dependencies. A packet loss occurring at an L2 picture will only
affect L2
pictures; LO and L1 pictures can still be decoded and displayed. Similarly, a
packet
loss occurring at an L1 picture will only affect L1 and L2 pictures; LO
pictures can
still be decoded and displayed.

[0057] The same error containment properties of the threads extend to S
packets. For
example, with structure 700 (FIG. 7) a loss occurring at an S2 picture only
affects the
particular picture, whereas a loss at an S 1 picture will also affect the
following S2
picture. In either case, drift will terminate upon decoding of the next SO
picture.
[0058] With the use of threaded structures, if the base layer and some
enhancement
layer pictures are transmitted in such a way that their delivery is
guaranteed, the

remaining layers can be transmitted on a best-effort basis without
catastrophic results
in the case of a packet loss. The required guaranteed transmissions can be
performed
using DiffServ, FEC techniques, or other suitable techniques known in the art.
For
the description herein it is assumed that the guaranteed and best effort
transmissions
occur over the two actual or virtual channels (e.g. a High Reliability Channel
(HRC)

and Low Reliability Channel (LRC), respectively) that offer such
differentiated


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
quality of service. (See e.g., International patent application Nos.
PCT/US06/028366
and PCT/US06/061815).

[0059] Consider, for example, that layers L0-L2 and S0 are transmitted on the
HRC,
and that S 1 and S2 are transmitted on the LRC. Although the loss of an S 1 or
S2

packet would cause limited drift, it would still be desirable to be able to
conceal as
much as possible the loss of information. The concealment of a lost S 1 or S2
picture
can only use information available to the decoder, namely past S pictures, and
also the
coded information of the current picture's base layer.

[0060] An exemplary concealment technique according to the present invention

utilizes the base layer information of the lost enhancement layer frame, and
applies it
in the decoding loop of the enhancement layer. The base layer information that
can
be used includes motion vector data (appropriately scaled for the target layer
resolution), coded prediction error difference (upsampled for the enhancement
layer
resolution, if necessary), and intra data (upsampled for the enhancement layer

resolution, if necessary). Prediction references from prior pictures are
taken, when
needed, from the enhancement layer resolution pictures rather than the
corresponding
base layer pictures. This data allows the decoder to reconstruct a very close
approximation of the missing frame, thus minimizing the actual and perceived
distortions on the missing frame. Furthermore, decoding of any dependent
frames is

now also possible since a good approximation of the missing frame is
available.
[0061] FIG. 8 shows exemplary steps 810-840 of a concealment decoding process
800, using an example of a two-layer spatial scalability encoded signal with
resolutions QCIF and CIF and two prediction threads (LO/SO and L1/S1). It will
be
understood that process 800 is applicable to other resolutions and to
different numbers

of threads than shown. In the example, it is assumed that at coded data
arrival step
21


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
810 the coded data for LO, SO, and L1 arrive intact at the receiving terminal,
but the
coded data for S 1 are lost. Further, it is assumed that all coded data for
pictures prior
to the picture corresponding to time tO also have been received at the
receiving
terminal. The decoder is thus able to properly decode both a QCIF and a CIF
picture

at time tO. The decoder can further use the information contained in LO and L1
to
reconstruct the correct decoded L1 picture corresponding to time t1.

[0062] FIG. 8 shows a particular example, in which a block of the L1 picture
at time
t1, LB1 is encoded at base layer decoding step 820 by using motion-compensated
prediction with a motion vector LMV 1 and a residual LRES 1 that is to be
added to the

motion-compensated prediction. The data for LMV 1 and LRES 1 are contained in
the
L1 data received by the receiving terminal. The decoding process requires
block LBO
from the prior base layer picture (the LO picture), which is available at the
decoder as
a result of the normal decoding process. Since the S 1 data assumed to be lost
in this
example, the decoder cannot use the corresponding information to decode the

enhancement layer picture.

[0063] Concealment decoding process 800, constructs an approximation for an
enhancement layer block SB1. At concealment data generation step 830, process
800
generates concealment data by obtaining the coded data of the corresponding
base
layer block LB 1, in this example LMV 1 and LRES 1. It then scales the motion
vector

to the resolution of the enhancement layer, to construct an enhancement layer
motion
vector SMV 1. For the two-layer video signal example considered, SMV 1 is
equal to
two times LMV 1 since the ratio of resolutions of the scalable signal is 2.
Further, the
concealment decoding process 800 upsamples the base layer residual signal to
the
resolution of the enhancement layer, by a factor of 2 in each dimension, and
then

optionally low-pass filters the result with the filter LPF, in accordance with
well-
22


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
known principles of sample rate conversion processes. The further result of
concealment data generation step 830 is a residual signal SRES1. Next step 840
(Decoding process for the enhancement layer with concealment) uses the
constructed
concealment data SMV 1 and SRES 1 to approximate block SB 1. It is noted that
the

approximation requires the block SBO from the previous enhancement layer
picture,
which is assumed to be available at the decoder as a result of the regular
decoding
process of the enhancement layer. Different encoding modes may operate in the
same
or similar way.

[0064] A further illustrative application of the inventive concealment
technique
relates to the example of high resolution images. In high resolution images
(e.g.,
greater than CIF) often more then one MTU (maximum transmission unit) is
required

to transmit a frame of the enhancement layer. If the chance of successful
transmission
of a single MTU sized packet is p, the chance of successful transmission of a
frame
comprised of n MTUs is p'. Traditionally, in order to display such a frame,
all n

packets have to be successfully delivered.

[0065] In the application of the inventive concealment technique, an S layer
frame is
broken into MTU size slices at the encoder for transmission. On the decoder
side
whatever slices are available from the S picture as received are used. Missing
slices
are compensated for using the concealment method (e.g., process 800), thus
reducing
the overall distortion.

[0066] In a laboratory experiment, this concealment technique provided similar
or
better performance when compared with direct coding at the effective
communication
rate (total rate minus loss rate). For the experiment, it was assumed that
layers L0-L2
are reliably transmitted on the HRC, while layers S 1 and S2 are transmitted
on the

LRC. Actual quality losses, in terms of Y-PSNR, were in the range of 0.2-0.3
dB per
23


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
5% of packet loss, clearly outperforming other known concealment techniques
such as
frame copy or motion-compensated frame copy. (See e.g., S. Bandyopadhyay, Z.
Wu,
P. Pandit, and J. Boyce, "Frame Loss Error Concealment for H.264/AVC," Doc.

JVT-P072, Poznan, Poland, July 2005, who report several dBs of loss with loss
rates
of even 5% in evaluations of single-layer AVC coding with an IPP...PI
structure, and
an I period of 1 sec.) The laboratory experiment results demonstrate that the

technique is effective for providing error resilience in scalable codecs.

[0067] FIG. 9 shows rate-distortion curves obtained using the standard
"foreman"
video test sequence with different QPs. For each QP, rate-distortion values
were
obtained by dropping different amount of S 1 and S2 frames, while applying the

inventive error concealment technique described above. As seen in FIG. 9, the
right-
most points for each QP curve correspond to no loss, and then (in a right-to-
left
direction), 50% of S2 dropped, 100% of S2 dropped, 100% of S2 and 50% of S 1
dropped, and 100% of S 1 and S2 dropped. The R-D curve of the codec, which is

obtained by connecting the zero-loss points for the different QPs, is
overlayed. It will
be seen from FIG. 9 that various curves particularly for QPs smaller than 30
are close
to the R-D curve but in some case are higher. It is expected that the
difference will be
eliminated with further optimization of the basic codec used.

[0068] The laboratory experiment results show that Y-PSNR is similar to the Y-

PSNR of the same encoder operating at the effective transmission rate. This
suggests
that the concealment technique can be advantageously used for rate control
purposes.
The effective transmission rate is defined as the transmission rate minus the
loss rate,
i.e., the rate calculated based on the packets that actually arrive at the
destination.
The bit rate corresponding to S 1 and S2 frames is typically 30% of the total
for the

specific coding structure, which implies that any bit rate between 70% and
100% may
24


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
be achieved by eliminating a selected number of S 1 and S2 frames for rate
control.
Bit rates between 70% and 100% may be achieved by selecting the number of S2
or
S 1 and S2 frames that are dropped in a given time period.

[0069] An even wider range for rate control may be obtained for picture coding

structure using LR/SR pictures, which are described, for example, in
International
patent application No. PCT/US06/061815. With such picture structures, it
possible
not to transmit the SO in the HRC, but to only include the lower temporal
resolution
SR in the HRC. This feature enables a wider range for rate control.

[0070] Table I summarizes the rate percentage of the different frame types for
a
typical video sequences (e.g., spatial scalability, QCIF-CIF resolution, three-
layer
threading, 380 Kbps).

Table I

Frame Type Rate (%) Cumulative
Rate (%)
LO 15 15
L1 7 22
L2 4 26
SO 46 72
S1 18 90
S2 10 100

[0071] By combining different frame types, the concealment technique can
achieve
practically any desired rate. For example, when all of the L0-L2 and SO
pictures are
included, and only 1 out of 10 S 1 pictures are dropped, a rate which is
approximately
72+1.8=73.8% of the total can be achieved. Alternative techniques known in the
art
such as Fine Granularity Scalability (FGS) attempt to achieve similar rate
flexibility,
but with very poor rate-distortion performance and significant computational



CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
overhead. The concealment technique of the present invention offers the rate
scalability associated with FGS, but without the coding efficiency penalty
associated
with such techniques.

[0072] The intentional elimination of S 1 and S2 frames from the video
transmission
may be performed either at the encoder or at an available intermediate gateway
(e.g.,
a SVCS/CSVCS).

[0073] Further, it will be understood that the application of the concealment
technique
of the present invention for achieving rate control has been described herein
with the
loss of S 1 frames in a two-layer structure, only for purposes of
illustration. In

practice, the technique is not limited to a particular threading structure,
but can be
applied to any spatially-scalable codec that uses a pyramidal temporal
structure (e.g.,
structures including more than two quality or spatial levels, different
temporal
structures, etc.).

[0074] A further use of the inventive concealment technique is to display the
video
signal at a resolution in between the two coded resolutions. For example,
assume a
video signal is coded at QCIF and CIF resolution using a spatially scalable
codec. If a
user wants to display the output in'/z CIF resolution (HCIF), a traditional
decoder
would follow one of two approaches: 1) decode the QCIF signal and upsample to
HCIF, or 2) decode the CIF signal and downsample to HCIF. In the first case,
the

HCIF picture quality will not be good, but the bitrate used will be low. In
the second
case, the quality can be very good, but the bitrate used will also be nearly
double that
required in the first approach. These disadvantages of traditional decoders
are
overcome by the inventive error concealment techniques.

[0075] For example, intentionally discarding all S 1 and S2 frames can result
in a

significant bandwidth reduction with very little drop in quality by applying
the S 1/S2
26


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
error concealment technique described herein. By downsampling the resulting
decoded CIF signal, a very good rendition of the HCIF signal is obtained. It
is noted
that conventional simulcast techniques in which separate single-layer streams
are
transmitted at QCIF and CIF resolutions, do not allow such derivation of the
signal at

an intermediate resolution at a usable bit rate unless the frame rate is also
dropped.
The inventive concealment technique exploits spatially scalable coding for
deriving
intermediate resolution signals at a usable bit rate.

[0076] In practice, application of the inventive concealment technique for
deriving an
intermediate resolution requires operation of the enhancement layer decoding
loop for
SO at full resolution. The decoding involves both the generation of the
decoded

prediction error, as well as the application of motion compensation at full
resolution.
In order to reduce the computational requirements only the decoded prediction
error
may be generated in full resolution, followed by downsampling to the target
resolution (e.g., HCIF). The reduced resolution signal may then be motion

compensated using appropriately scaled motion vectors and residual
information.
This technique can also be used on any portion of the `S' layer that is
retained for
transmission to the receiver. As there will be drift introduced in the
enhancement
layer decoding loop, a mechanism to periodically eliminate drift may be
required. In
addition to standard techniques such as I frames, the periodic use of the
INTRA_BL

mode of spatial scalability for each enhancement layer macroblock may be
employed,
where only information from the base layer is used for prediction. (See e.g.,
PCT/US06/28365). Since no temporal information is used, the drift for that
particular
macroblock is eliminated. If SR pictures are used, drift can also be
eliminated by
decoding all SR pictures at full resolution. Since SR pictures are far apart,
there can

still be considerable gain in computational complexity. In some cases, the
technique
27


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
for deriving an intermediate resolution signal may be modified by operating
the
enhancement layer decoder loop in reduced resolution. In cases, where CPU
resources are not a limiting factor and faster switching than the SR
separation is
required or desired, the same (i.e., operating the decoder loop at full
resolution) can be

applied to higher temporal level (e.g., S0) as needed.

[0077] Another exemplary application of the inventive concealment technique is
to a
video conferencing system in which spatial or quality levels are achieved via
simulcast. In this case, concealment is performed using base layer information
as
described above. The enhancement layer's drift can be eliminated via any one
of a)

threading, b) standard SVC temporal scalability, c) periodic I frames, and d)
periodic
intra macroblocks.

[0078] An SVCS/CSVCS that is utilizing simulcast to provide spatial
scalability, and
is only transmitting the higher resolution information for a particular
destination for a
particular stream (for example if it assumes no or almost no errors), may
replace a

missing frame of the high resolution with a low resolution one, anticipating
such error
concealment mechanism on the decoder, and relying on temporal scalability to
eliminate drift as discussed above. It will be understood that the concealment
process
described can be readily adapted to create an effective rate control on such a
system.
[0079] In the event that the SVCS, CSVCS or the encoder responsible for
discarding

the higher resolution frames or detecting its loss, cannot assume that the
decoder
receiving such frames is equipped with the concealment method described
herein,
such entity may create a replacement high resolution frame that will achieve a
similar
functionality by one of following methods:

a) for error resilience in spatial scalability coding, create a synthetic
frame, based on parsing of the lower resolution frame that will include only
the
28


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
appropriate signaling to use upsampled base layer information without any
additional
residuals or motion vector refinement;

b) for rate control in a system using spatial scalability, the combination
of the method described in (a) with the addition that some macroblocks (MBs)
containing significant information from the original high resolution frame are
retained;

c) for an error resilient system using simulcast for spatial scalability,
create a replacement high resolution frame that will include synthetic MBs
that will
include upsampled motion vectors and residual information; d) for rate control
in a

system using simulcast for spatial scalability, the method described in (c)
with the
addition that some MBs containing significant information from the original
high
resolution frame are retained.

[0080] In the cases a) and b) above, the signaling to use only an upsampled
version of
the base layer picture can be performed either in-band through the coded video

bitstream or through out-of-band information that is sent from the encoder or
SVCS/CSVCS to the receiving terminal. For the in-band signaling case, specific
syntax elements in the coded video bitstream must be present in order to
instruct the
decoder to use only the base layer information for some or all enhancement
layer
MBs. In an exemplary codec of the present invention, which is based on the JD7

version of the SVC specification (see T. Wiegand, G. Sullivan, J. Reichel, H.
Schwarz, M. Wien, eds., "Joint Draft 7, Rev. 2: Scalable Video Coding," Joint
Video
Team, Doc. JVT-T201, Klagenfurt, July 2006, incorporated herein by reference
in its
entirety) and described in provisional U.S. patent application Serial No.
60/862,5 10, a
set of flags can be introduced at the slice header, to indicate that when a
macroblock

is not coded, specific prediction modes that utilize the base layer data are
to be used.
29


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
By skipping all enhancement layer macroblocks, the encoder or SVCS/CSVCS will
practically eliminate the S 1 or S2 frames, but replace them with very small
data
packets that only contain the few bytes necessary to indicate the default
prediction
modes and the fact that all macroblocks are skipped. Similarly, for performing
rate

control, the encoder or SVCS/SVCS may selectively eliminate some information
from
enhancement layer MBs. For example, the encoder or SVCS/SVCS may selectively
maintain motion vector refinements, but eliminate residual prediction, or keep
residual prediction, but eliminate motion vector refinements.

[0081] With continued reference to the SVC JD7 specification, there are
several flags
in the MB layer (in scalable extension) that are used for predicting
information from
the base layer, if the base layer exists. They are base_mode_flag,
motion_predictionflag and residual_prediction_flag. Similarly, there already
exists a
flag in the slice header, adaptive_predictionflag, which is used to indicate
the
presence of base_mode_flag in the MB layer. To trigger the concealment
operation,

one needs to set base_mode_flag to 1 for every MB, which can be done using the
already existing adaptive_prediction_flag. By setting the slice header flag
adaptive_predictionflag to 0, and taking into account that the default value
for the
residue_predictionflag in inter MBs is 1. we can indicate that all MBs in a
slice are
skipped (using mb_skiprun or mb_skip_flag signaling) and thus direct the
decoder to

essentially perform the concealment operation disclosed herein.

[0082] It is recognized that a potential drawback of the concealment technique
is that
the bitrate of the coded stream without the S 1 and S2 frames may be very
uneven or
"bursty," since the SO frames are typically quite large (e.g., as high as 45%
of the total
bandwidth. To mitigate this behavior, in a modification (hereinafter
"progressive

concealment") the SO packets may be transmitted by splitting them into smaller


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
packets and/or slices and spreading their transmission over the time interval
between
successive SO pictures. The entire SO picture will not be available for the
first S2
picture, but information that has been received by the first S2 picture (i.e.,
portions of
SO and the entire LO and L2) can be used for concealment purposes. In this
manner

the decoder can also recover an appropriate reference frame in time to display
the
L1/S 1 picture, which would further help in creating decoded version of both
the
L1/S1 picture and the second L2/S2. Otherwise, as they are further apart from
the LO
frame, they may show more concealment artifacts due to motion.

[0083] Another alternative solution to mitigate the effects of bursty SO
transmissions
is to smooth out the variable bit-rate (VBR) traffic by additional buffering
at the cost
of increased end-to-end delay. It is noted that in multipoint conferencing
applications,
there is inherent statistical multiplexing at the server. Therefore, the VBR
behavior of
the traffic originating from the server will be naturally smoothed.

[0084] International patent application No. PCT/US06/061815 describes the
problems
of error resilience and random access and provides solutions appropriate for
different
application scenarios.

[0085] The progressive concealment technique provides a further solution for
performing video switching. The progressive concealment technique described
above
also may be used for video switching. An exemplary switching application is to
a

single-loop, spatially scalable signal coded at QCIF and CIF resolutions with
a three-
layer threading structure, with the three-layer threading structure shown in
FIG. 7. As
described in International patent application No. PCT/US06/061815, increased
error
resilience can be achieved by ensuring reliable transmission of some of the LO
pictures. The LO pictures that are reliably transmitted are referred to as LR
pictures.

The same threading pattern can be extended to the S pictures, as shown in FIG.
10.
31


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
The temporal prediction paths for the S pictures are identical to those of the
L
pictures. FIG. 10 shows an exemplary SR period of 1/3 (one out of every 3 S0
pictures is SR) for purposes of illustration. In practice, different periods
and different

threading patterns can be used in accordance with the principles of the
present

invention. Further, different paths in the S and L pictures could also be
used, but with
a reduction in coding efficiency for the S pictures. As with LR pictures, the
SR
pictures are assumed to be transmitted reliably. As described in International
patent
application No. PCT/US06/061815, this can be accomplished using a number of
techniques, such as DiffServ coding (where LR and SR are in the HRC), FEC or

ARQ.

[0086] In the exemplary switching application of the progressive concealment
technique, the progressive concealment technique, an end-user at terminal
receiving a
QCIF signal may desire to switch to a CIF signal. In order to be able to start
decoding
the enhancement layer CIF signal, the terminal must acquire at least one
correct CIF

reference picture. A technique disclosed in International patent application
No.
PCT/US06/061815 involves using periodic intra macroblocks, so that within a
period
of time all macroblocks of the CIF picture will be intra coded. A drawback is
that it
will take a significant amount of time to do so, if the percentage of intra
macroblocks
is kept low (to minimize their impact on the total bandwidth). In contrast,
the

switching application of the progressive concealment technique exploits the
reliable
transmission of the SR pictures in order to be able to start decoding the
enhancement
layer CIF signal.

[0087] The SR pictures can be transmitted to the receiver and be decoded even
if it
operates at a QCIF level. Since they are infrequent, their overall effect on
the bit rate
can be minimal. When a user switches to the CIF resolution, the decoder can
utilize

32


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
the most recent SR frame, and proceed as if intermediate S pictures until the
first S
picture received were lost. If additional bit rate is available, the sender or
server can
also forward cached copies of all intermediate SO pictures to further aid the
receiver in
constructing a reference picture as close to the starting frame of CIF
playback as

possible. The rate-distortion performance of the S 1/S2 concealment technique
will
ensure that the impact on quality is minimized.

[0088] The inventive technique can also be used advantageously when the end-
user
decodes at an intermediate output resolution, e.g., HCIF, and desires to
switch to CIF.
An HCIF signal can be effectively derived from the LO-L2 and portion of the S0-
S2

pictures (e.g., only SO), coupled with concealment for dropped S frames. In
this case,
the decoder, which receives at least a portion of the SO pictures, can
immediately
switch to CIF resolution with very small PSNR penalty. Further, this penalty
will be
eliminated as soon as the next SO/SR picture arrives. Thus, in this case,
there is
practically no overhead and almost instantaneous switching can be achieved.

[0089] It is noted that although typical spatial coding structures employ 1:4
picture
area ratios, some users are more comfortable with resolution changes of 1:2.
Therefore, in practice HCIF to CIF switching transitions are much more likely
than
QCIF to CIF switching transitions, for example, in desktop communication
applications. A common scenario in video conferencing is that the screen real
estate

is split into a large picture of the active speaker surrounded by smaller
pictures of the
other participants, and where the active speaker image automatically occupies
the
larger image. In the case where the smaller images where created using the
rate
control methods described herein, such a switch can be done frequently without
any
overhead. The switching of participant images can be done frequently in an
"active"

layout without any overhead. This feature is desirebable for accommodating
both
33


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
conference participants who prefer to view such an active layout, and other
conference participants who prefer a static view. Since the switching-by-
concealment
method does not require any additional information to be sent by the encoder,
the
choice of layout by one receiver does not impact the bandwidth received by
others.

[0090] The foregoing description refers to creating effective rendering for
intermediate resolutions and bit rates that span the range between
resolutions/bit rates
directly provided by the encoder. It will be understood that other methods
that are
known to decrease the bit rate (e.g., by introducing drift) such as data
partitioning or
re-quantization can be employed by the SVCS/CSVCS in conjunction with
inventive

methods described herein to provide a more detailed manipulation of the bit
stream.
For example, assume that a resolution of 1/3 CIF is desired when only QCIF and
CIF
are available, and that the SR, S0-S2 coding structure is used. Eliminating S
1 and S2
only may result in a bit rate that is too high to effectively be used as 1/3
CIF. Further,
eliminating SO may result in a bit rate that is too low and/or be visually
unacceptable

due to motion-related artifacts. In such a case, reducing the amounts of bits
of the SO
frames using known methods as data partitioning or re-quantization may be
useful in
conjunction with the SR transmission (either in VBR mode or using the
progressive
concealment) to provide a more optimized result. It will be understood that
these
methods may be applied to the S 1 and S2 levels to achieve more fine-tuned
rate

control.

[0091] Although the preferred embodiments described herein use the H.264 SVC
draft standard, as is obvious to persons skilled in the art the techniques can
be directly
applied to any coding structure that allows multiple spatial/quality, and
temporal
levels.

34


CA 02644753 2008-09-03
WO 2007/103889 PCT/US2007/063335
[0092] It also will be understood that in accordance with the present
invention, the
scalable codecs and concealment techniques described herein may be implemented
using any suitable combination of hardware and software. The software (i.e.,

instructions) for implementing and operating the aforementioned scalable
codecs can
be provided on computer-readable media, which can include without limitation,
firmware, memory, storage devices, microcontrollers, microprocessors,
integrated
circuits, ASICS, on-line downloadable media, and other available media.


Representative Drawing
A single figure which represents the drawing illustrating the invention.
Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee  and Payment History  should be consulted.

Administrative Status

Title Date
Forecasted Issue Date Unavailable
(86) PCT Filing Date 2007-03-05
(87) PCT Publication Date 2007-09-13
(85) National Entry 2008-09-03
Examination Requested 2009-02-24
Dead Application 2020-02-11

Abandonment History

Abandonment Date Reason Reinstatement Date
2012-12-21 R30(2) - Failure to Respond 2013-11-25
2019-02-11 FAILURE TO PAY FINAL FEE
2019-03-05 FAILURE TO PAY APPLICATION MAINTENANCE FEE

Payment History

Fee Type Anniversary Year Due Date Amount Paid Paid Date
Registration of a document - section 124 $100.00 2008-09-03
Application Fee $400.00 2008-09-03
Maintenance Fee - Application - New Act 2 2009-03-05 $100.00 2008-09-03
Registration of a document - section 124 $100.00 2008-12-01
Request for Examination $800.00 2009-02-24
Maintenance Fee - Application - New Act 3 2010-03-05 $100.00 2010-03-02
Maintenance Fee - Application - New Act 4 2011-03-07 $100.00 2011-02-28
Maintenance Fee - Application - New Act 5 2012-03-05 $200.00 2012-02-22
Maintenance Fee - Application - New Act 6 2013-03-05 $200.00 2013-02-21
Reinstatement - failure to respond to examiners report $200.00 2013-11-25
Maintenance Fee - Application - New Act 7 2014-03-05 $200.00 2014-02-21
Maintenance Fee - Application - New Act 8 2015-03-05 $200.00 2015-02-18
Maintenance Fee - Application - New Act 9 2016-03-07 $200.00 2016-02-17
Maintenance Fee - Application - New Act 10 2017-03-06 $250.00 2017-02-17
Maintenance Fee - Application - New Act 11 2018-03-05 $250.00 2018-02-19
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
VIDYO, INC.
Past Owners on Record
ELEFTHERIADIS, ALEXANDROS
HONG, DANNY
LAYERED MEDIA, INC.
SHAPIRO, OFER
WIEGAND, THOMAS
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Abstract 2008-09-03 2 84
Claims 2008-09-03 26 867
Drawings 2008-09-03 10 219
Description 2008-09-03 35 1,471
Representative Drawing 2009-01-21 1 10
Cover Page 2009-01-22 2 57
Description 2011-09-09 35 1,451
Claims 2012-04-27 27 901
Drawings 2013-11-25 10 163
Description 2013-11-25 35 1,453
Claims 2013-11-25 27 915
Claims 2015-09-28 11 402
Claims 2016-06-07 11 384
Amendment 2017-05-23 30 980
Description 2017-05-23 35 1,364
Claims 2017-05-23 12 358
Examiner Requisition 2017-10-13 4 204
Correspondence 2009-01-09 1 31
Amendment 2018-04-06 27 952
Claims 2018-04-06 12 390
PCT 2008-09-03 4 166
Assignment 2008-09-03 4 111
Assignment 2008-12-01 8 311
Prosecution-Amendment 2009-02-24 1 38
Fees 2010-03-02 1 201
Prosecution-Amendment 2011-03-09 2 82
Prosecution-Amendment 2011-09-09 5 141
Prosecution-Amendment 2011-10-27 2 85
Prosecution-Amendment 2012-04-27 31 1,043
Prosecution-Amendment 2012-06-21 2 72
Prosecution-Amendment 2013-11-25 1 35
Prosecution-Amendment 2013-11-25 73 2,309
Prosecution-Amendment 2013-12-11 1 37
Prosecution-Amendment 2014-01-15 3 101
Fees 2014-02-21 1 33
Examiner Requisition 2015-12-07 3 199
Prosecution-Amendment 2015-03-26 4 234
Amendment 2015-09-28 41 1,631
Amendment 2016-01-21 1 45
Amendment 2016-06-07 27 971
Examiner Requisition 2016-11-22 3 186