Language selection

Search

Patent 2259374 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent Application: (11) CA 2259374
(54) English Title: SPEECH SYNTHESIS SYSTEM
(54) French Title: SYSTEME DE SYNTHESE VOCALE
Status: Deemed Abandoned and Beyond the Period of Reinstatement - Pending Response to Notice of Disregarded Communication
Bibliographic Data
(51) International Patent Classification (IPC):
(72) Inventors :
  • XYDEAS, COSTAS (United Kingdom)
(73) Owners :
  • THE VICTORIA UNIVERSITY OF MANCHESTER
(71) Applicants :
  • THE VICTORIA UNIVERSITY OF MANCHESTER (United Kingdom)
(74) Agent: MARKS & CLERK
(74) Associate agent:
(45) Issued:
(86) PCT Filing Date: 1997-07-07
(87) Open to Public Inspection: 1998-01-15
Examination requested: 2002-06-28
Availability of licence: N/A
Dedicated to the Public: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): Yes
(86) PCT Filing Number: PCT/GB1997/001831
(87) International Publication Number: GB1997001831
(85) National Entry: 1998-12-29

(30) Application Priority Data:
Application No. Country/Territory Date
021,815 (United States of America) 1996-07-16
9614209.6 (United Kingdom) 1996-07-05

Abstracts

English Abstract


A speech synthesis system in which a speech signal is divided into a series of
frames, and each frame is converted into a coded signal including a
voiced/unvoiced classification and a pitch estimate, wherein a low pass
filtered speech segment centred about a reference sample is defined in each
frame, a correlation value is calculated for each of a series of candidate
pitch estimates as the maximum of multiple crosscorrelation values obtained
from variable length speech segments centred about the reference sample, the
correlation values are used to form a correlation function defining peaks, and
the locations of the peaks are determined and used to define a pitch estimate.


French Abstract

Système de synthèse vocale dans lequel un signal de voix est divisé en une série de trames, et chaque trame est convertie en un signal codé comprenant une classification voisée/non voisée et une estimation de hauteur tonale, dans lequel un segment de voix filtré avec un filtre passe-bas centré sur un échantillon de référence est défini dans chaque trame, une valeur de corrélation est calculée pour chaque proposition d'une série de propositions de hauteur tonale, cette valeur représentant le maximum d'une multiplicité de valeurs d'intercorrélation dérivées de segments de voix de longueur variable centrés sur l'échantillon de référence. Les valeurs de corrélation sont utilisées pour former une fonction de corrélation pour définir des pics, les localisations de ces pics étant déterminées et utilisées pour établir une estimation de hauteur tonale.

Claims

Note: Claims are shown in the official language in which they were submitted.


68
CLAIMS
1. A speech synthesis system in which a speech signal is divided into a series
of frames, and each frame is converted into a coded signal including a
voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered
speech segment centred about a reference sample is defined in each frame, a
correlation value is calculated for each of a series of candidate pitch estimates as
the maximum of multiple crosscorrelation values obtained from variable length
speech segments centred about the reference sample, the correlation values are
used to form a correlation function defining peaks, and the locations of the
peaks are determined and used to define a pitch estimate.
2. A system according to claim 1, wherein the pitch estimate is defined using
an iterative process.
3 A system according to claim 1 or 2, wherein a single reference sample may
be used, centred with respect to the respective frame.
4. A system according to claim 1 or 2, wherein multiple pitch estimates are
derived for each frame using different reference samples, the multiple pitch
estimates being combined to define a combined pitch estimate for the frame.

69
5. A system according to any preceding claim, wherein the pitch estimate is
modified by reference to a voiced/unvoiced status and/or pitch estimates of
adjacent frames to define a final pitch estimate.
6. A system according to any preceding claim, wherein the correlation
function is clipped using a threshold value, remaining peaks being rejected if
they are adjacent to larger peaks.
7. A system according to claim 6, wherein peaks are selected which are
larger that either adjacent peak and peaks are rejected if they are smaller than a
following peak by more than a predetermined factor.
8. A system according to any preceding claim, wherein the pitch estimation
procedure is based on a least squares error algorithm.
9. A system according to claim 8, wherein the pitch estimation algorithm
defines the pitch valve as a number whose multiples best fit the correlation
function peak locations.
10. A system according to any preceding claim, wherein possible pitch values
are limited to integral numbers which are not consecutive, the increment

between two successive numbers being proportional to a constant multiplied by
the lower of those two numbers.
11. A speech synthesis system in which a speech signal is divided into a series
of frames, and each frame is converted into a coded signal including pitch
segment magnitude spectral information, a voiced/unvoiced classification, and a
mixed voiced classification which classifies harmonics in the magnitude spectrum
of voiced frames as strongly voiced or weakly voiced, wherein a series of samples
centred on the middle of the frame are windowed to form a data array which is
Fourier transformed to produce a magnitude spectrum, a threshold value is
calculated and used to clip the magnitude spectrum, the clipped data is searched
to define peaks, the locations of peaks are determined, constraints are applied to
define dominant peaks, and harmonics not associated with a dominant peak are
classifed as weakly voiced.
12. A system according to claim 11, wherein peaks are located using a second
order polynomial
13. A system according to claim 11 or 12, wherein the samples are Hamming
windowed.

71
14. A system according to claim 11, 12 or 13, wherein the threshold value is
calculated by identifying the maximum and minimum magnitude spectrum
values and defining the threshold as a constant multiplied by the difference
between the maximum and minimum values.
15. A system according to any one of claims 11 to 14, wherein peaks are
defined as those values which are greater than the two adjacent values, a peak
being rejected from consideration if neighbouring peaks are of a similar
magnitude or if there are spectral magnitudes in the same range of greater
magnitude.
16. A system according to any one of claims 11 to 15, wherein a harmonic is
considered as not being associated with a dominant peak if the difference
between two adjacent peaks is greater than a predetermined threshold value.
17. A system according to any one of claims 11 to 16, wherein the spectrum is
divided into bands of fixed width and a strongly/weakly voiced classification is
assigned for each band.
18. A system according to any one of claims 11 to 17, wherein the frequency
range is divided into two or more bands of variable width, adjacent bands being

72
separated at a frequency selected by reference to the strongly/weakly voiced
classification of harmonics.
19. A system according to claim 17 or 18, wherein the lowest frequency band
is regarded as strongly voiced, whereas the highest frequency band is regarded
as weakly voiced.
20. A system according to claim 19, wherein the event that a current frame is
voiced, and the following frame is unvoiced, further bands within the current
frame will be automatically classified as weakly voiced.
21. A system according to claim 19 or 20, wherein the strongly/weakly voiced
classification is determined using a majority decision rule on the strongly/weakly
voiced classification of those harmonics which fall within the band in question.
22. A system according to claim 21, wherein, if there is no majority, alternate
bands are alternately assigned strongly voiced and weakly voiced classifications.
23. A speech synthesis system in which a speech signal is divided into a series
of frames, each frame is defined as voiced or unvoiced, each frame is converted
into a coded signal including a pitch period value, a frame voiced/unvoiced
classification and, for each voiced frame, a mixed voiced spectral band

73
classification which classifies harmonics within spectral bands as either strongly
or weakly voiced, and the speech signal is reconstructed by generating an
excitation signal in respect of each frame and applying the excitation signal to a
filter, wherein for each weakly voiced spectral band, an excitation signal is
generated which includes a random component in the form of a function which is
dependent upon the respective pitch period value.
24. A system according to claim 23, wherein the spectrum is divided into
bands and a strongly/weakly voiced classification is assigned to each band.
25. A system according to claim 23 or 24, wherein the random component is
introduced by reducing the amplitude of harmonic oscillators assigned the
weakly voiced classification, disturbing the oscillator frequencies such that the
frequency is no longer a multiple of the fundamental frequency, and then adding
further random signals.
26. A system according to claim 25, wherein the phase of the oscillators is
randomised.
27. A speech synthesis system in which a speech signal is divided into a series
of frames, and each voiced frame is converted into a coded signal including a
pitch period value LPC coefficients and pitch segment spectral magnitude

74
information, wherein the spectral magnitude information is quantized by
sampling the LPC short term magnitude spectrum at harmonic frequencies, the
locations of the largest spectral samples are determined to identify which of the
magnitudes are relatively more important for accurate quantization, and the
magnitudes so identified are selected and vector quantized.
28. A system according to claim 27, wherein a pitch segment of P n LPC
residual samples is obtained, where P n is the pitch period value of the nth frame,
the pitch segment is DFT transformed, the mean value of the resultant spectral
magnitudes is calculated, the mean value is quantized and used as a
normalisation factor for the selected magnitudes, and the resulting normalised
amplitudes are quantized.
29. A system according to claim 27, wherein the RMS value of the pitch
segment is calculated, the RMS value is quantized and used as a normalisation
factor for the selected magnitudes, and the resulting normalised amplitudes are
quantized.
30. A system according to any one of claims 27 to 29, wherein, at the receiver,
the selected magnitudes are recovered, and each of the other magnitude values is
reproduced as a constant value.

31. A speech synthesis system in which a variable size input vector of
coefficients to be transmitted to a receiver for the reconstruction of a speech
signal is vector quantized using a codebook defined by vectors of fixed size, the
codebook vectors of fixed size are obtained from variable sized training vectors
and an interpolation technique which is an integral part of the codebook
generation process, codebook vectors are compared to the variable sized input
vector using the interpolation process, and an index associated with the codebook
entry with the smallest difference from the comparison is transmitted, the index
being used to address a further codebook at the receiver and thereby derive an
associated fixed size codebook vector, and the interpolation process being used to
recover from the derived fixed sized codebook vector an approximation of the
variable sized input vector.
32. A system according to claim 31, wherein the interpolation process is
linear, and for an input vector of given dimension, the interpolation process is
applied to produce from the codebook vectors a set of vectors of that given
dimension, a distortion measure is then derived to compare the interpolated set
of vectors and the input vector, and the codebook vector is selected which yields
the minimum distortion.

76
33. A system according to claim 32, wherein the dimension of the vectors is
reduced by taking into account only the harmonic amplitudes within an input
bandwidth range.
34. A system according to claim 33, wherein the remaining amplitudes are set
to a constant value.
35. A system according to claim 34, wherein the constant value is equal to the
mean value of the quantized amplitudes.
36. A system according to any one of claims 31 to 35, wherein redundancy
between amplitude vectors obtained from adjacent residual frames is removed
by means of backward prediction.
37. A system according to claim 36, wherein the backward prediction is
performed on a harmonic basis such that the amplitude value of each harmonic
of one frame is predicted from the amplitude value of the same harmonic in the
previous frame or frames.
38. A speech synthesis system in which a speech signal is divided into a series
of frames, each frame is converted into a coded signal including an estimated
pitch period, an estimate of the energy of a speech segment the duration of which

77
is a function of the estimated pitch period, and LPC filter coefficients defining an
LPC spectral envelope, and a speech signal of related power to the power of the
input speech signal is reconstructed by generating an excitation signal using
spectral amplitudes which are defined from a modified LPC spectral envelope
sampled at harmonic frequencies defined by the pitch period.
39. A system according to claim 38, wherein the magnitude values are
obtained by spectrally sampling a modified LPC synthesis filter characteristic at
the harmonic locations related to the pitch period.
40. A system according to claim 39, wherein the modified LPC synthesis filter
has reduced feed back gain and a frequency response which consists of equalised
resonant peaks, the locations of which are close to the LPC synthesis resonant
locations.
41. A system according to claim 40, wherein the value of the feed back gain is
controlled by the performance of the LPC model such that it is related to the
normalised LPC prediction error.
42. A system according to any one of claims 38 to 41, wherein the energy of
the reproduced speech signal is equal to the energy of the original speech
waveform.

78
43. A speech synthesis system in which a speech signal is divided into a series
of frames, each frame is converted into a coded signal including LPC filter
coefficients and at least one parameter associated with a pitch segment
magnitude, and the speech signal is reconstructed by generating two excitation
signals in respect of each frame, each pair of excitation signals comprising a first
excitation signal generated on the basis of the pitch segment magnitude
parameter or parameters of one frame and a second excitation signal generated
on the basis of the pitch segment magnitude parameter or parameters of a
second frame which follows and is adjacent to the said one frame, applying the
first excitation signal to a first LPC filter the characteristics of which are
determined by the LPC filter coefficients of the said one frame and applying the
second excitation signal to a second LPC filter the characteristics of which are
determined by the LPC filter coefficients of the said second frame, and weighting
and combining the outputs of the first and second LPC filters to produce one
frame of a synthesised speech signal.
44. A system according to claim 43, wherein the first and second excitation
signals include the same phase function and different phase contributions from
the two LPC filters.

79
45. A system according to claim 44, wherein the outputs of the first and
second LPC filters are weighted by half a window function such that the
magnitude of the output of the first filter is decreasing with time and the
magnitude of the output of the second filter is increasing with time.
46. A speech coding system which operates on a frame by frame basis, and in
which information is transmitted which represents each frame as either voiced or
unvoiced and, for each voiced frame, represents that frame by a pitch period
value, quantized magnitude spectral information, and LPC filter coefficients, the
received pitch period value and magnitude spectral information being used to
generate residual signals at the receiver which are applied to LPC speech
synthesis filters the characteristics of which arc determined by the transmitted
filter coefficients, wherein each residual signal is synthesised according to
sinusoidal mixed excitation synthesis process, and a recovered speech signal is
derived from the residual signals.
47. A speech synthesis system substantially as hereinbefore described with
reference to the accompany drawings.

Description

Note: Descriptions are shown in the official language in which they were submitted.


CA 022~9374 1998-12-29
t WO 98lol8~8 PCT/GB97/O~l
SPI~FCH S~fNTHF~s~s SYSTF.M
The present invention relatcs to speech synthesis systems, and in
particular to speech systems codin~ and synthesis systems which can be used in
speech communication systems operating at low bit rates.
Speech can be represented as a waveform the detailed structure of which
represents the characteristics of the vocal tract and vocal excitation of the person
producing the speech. If a speech communication system is to t~e capable of
providing an adequate perceived ~uality, the transmitted information must be
capable of representing that detailed structure. ~ost of the power in voiced
speech is at relatively low frequencies, for example below 2kHz. ~ccordingly
good quality speech synthesis can be achieved on the basis of speech waveforms
that have been low pass filtered to reJect higher frequency components. The
perceived speech quality is however adversely effected if the frequency is
restricted much below 4kH~:.
Many models have been suggested for defining the characteristics of
speech. The known models rel~ upon dividing a speech signal into blocks or
frames and deriving parameters to represent the characteristics of the speech
within each frame. Those parameters are then quantized and transmitted to a
receiver. At the receiver the quantization process is reversed to recover the
parameters, and a speech signal is then synthesised on the basis of the recovered
parameters.

CA 02259374 1998-12-29-- -
~~ T WO 98/01848 PCT/GB97/O~BI
Thc common objcctivc of the designers of thc known models is to
minimice thc volume of data which must bc transmittcd whilst msl~imising the
perccived quality of the spcech that can bc synthcsised from thc transmitted
data. In some of the models a distinction is made betwecn whether or not a
particular frame is "voiced" or "unvoiccd". ln thc case of voiced speech, speech
is produced by glottal excitation and as a result has a quasi-periodic structure.
Unvoiced speech is produced l)y turbulent air flo~ at a constriction and does not
have the "periodic" spcctral structure charactcristic of voiced speech. Most
models seek to take advantage of the fact that voiced speech signals evolve
relatively slowly in the context of frames the duration of which is typically 10 to
30msecs. Most models also rely upon quantization schemes intended to minimi.~e
the amount of information which must bc transmitted without ci~nific~nt loss of
perceived quality. As a result of thc ~ork done to date it is now possible to
produce speech synthesis systems capable of operating at bit rate of only a few
thousand bits per second.
One model which has becn dcveloped is known as "sinusoidal coding"
(R.J. McAulay and T.F. Quatieri, "Low Rate Speecll Cofli~, ~ased on Sinusoidal
Coding", Advances in Speech 5~ignal Processing, I~ditors S. Furui and M.M.
Sondhi, Chapter 6, pp. 16~-208, Markel Dekker, New York, 1992). This
approach relies upon an FFT analysis of each input frame to produce a
magnitude spectrum, cstimating the pitch period of the input frame from that
spectrum, and defining the ~lmplitudes at the pitch relatcd harmonics, the

CA 02259374 1998-12-29
,~ ~ W O 98/01848 ~CT/GB97/Olg~
harmonics being multiples of the fundamcntal frcqucncy of the framc. An error
measure is calculated in thc timc domain representing the diffcrence bctween
harmonic and aharmonic spcech spcctra and that crror measurc is used to definc
the degree of voicing of the input frame in terms of a frequency valuc. Thus the
parameters used to rcpresent a framc arc the p;tch period, the magnitudc and
phase values for each harmonic, and the frequency value. Proposals have been
made to operate this system such that phase information is l)redictcd in a
coherent way across successive frames.
~ n another system known as "multiband excitation coding" (D.W. Grif~m
and J.S. Lim, "Mul~;~and F~it~rtjo/t Vocoder" IE~E Transaction on Acoustics,
Speech and Signal Proccssing, vol. 3G, pp 1223-1235, 1988 and Digital Voice
Systems Inc, "INMA~SAT M Voice Co~lec, Version 3.0", Voice Coding System
Des~ ion, Module 1, Appendix 1, August 1991) thc amplitude and phase
functions are determined in a diffcrent way frorn that employed in sinusoidal
coding. The emphasis in this system is placed on dividing a spcctrum into bands,
for cxample up to t~,velve bands, snd evaluating thc voiced/unvoiced nature of
each of these bands. Bands that arc classified as unvoiced are synthesised using
-random signals. Where the difference bctween the pitch estimates of successive
frames is relatively small, linear interpolation is used to dcfine the required
amplitudes. The phase function is also defined using linear frequcncy
. ~ interpolation but in addition includes a constant displacement ~Yhich is a random
variable and which depends on thc numbcr of unvoiced bands prescnt in thc

CA 02259374 1998-12-29
J T WO 98/01848 PCT/GB97/01~1
short term spectrum of the input signal. The systcm works in a way to prescrve
phase continuity between succcssive frames. W31en the pitch estimates of
successive frames are significantly different, a weighted summation of signals
produced from amplitudes and phases derived for successive frames is formed to
produced the synthesised signal.
Thus the common ground between the sinusoidal and multiband systems
referred to above is that both schemes directly model thc input speech signal
which is DFT analysed, and both systems arc at least partially bascd on the same
filnd~l~nental relationship for rcpresenting speech to l)c synthesised. The systems
differ ~owever in terms of the way in which amplitudes and phase are estim~tçd
and quantized, the way in which different interpolation methods arc used to
define the necessary phase relationships, and the way in which 'Lrandomness" is
introduced in the recovered speech.
Various versions of thc multiband excitation coding system have been
proposed, for e~ample an enhanced multiband excitation speech coder (A. Das
and ~ Gersho, Yariable-Dimension Spcctral Co~ling of Speecll nt 2400 bps and
below witlt pl~onetic classifcafioJt~ EE Proc. lCASSP-95, pp. 492-495, May
1995) in which input frames are classified into four types, that is noise, unvoiced,
fully voiced and mixed voiced, and a variable dimension vector quantization
process for spectral magnitude is introduced, the bi-harmonic spectral modelling
system (C. Garcia-Matteo., J. L. Alba-Castro and Eduardo R. ~anga, "Speech
Coding Using ~i-Harmonic Spectral Modelling", Proc. EUSIPCO-94,

CA 02259374 1998-12-29
" ~ W 098/01848 PCT/GB9710~1
Edingburgh, Vol. 2, ~p 391-394, Scptembcr 1994) in which thc short term
magnitude spectrum is divided into two bands and a scparatc pitch frequency is
calculated for each band, thc spcctral cxcitation codint, s~stem (V. Cuperman, P
Lupini and B. Bhattacharya, "S/~ectral Frc;~nfioll Coding of Speecl~ at 2.4 kb/s",
IEEE Proc. ICASSP-95, pp. 504-507, Detrpot, May 1995) which al~plies
sinusoidal based coding in the lincar predictivc coding (LPC) residual domain
where the synthesiscd residual signal is thc summation of pitch harmonic
oscillators ~vith appropriate amplitudc and phasc functions and amplitudes are
quant~zed using a non-squarc transformation, thc l~and-widencd harmonic
vocoder (G. Yang, G Zanellato and EI. Leich, "Banfd Widened ~armonic Vocoder
at 2 to 4 kbps", IEl~E Proc. 1CASSP-9:5, pp. ~;04-507, Detroit, May 1995) in which
randomness in the signal is introduced by a(lding jitter to the amplitude
information on a per band basis, pitch synchronous multiband coding (H. ~ang,
S. N. Koh and P. Sivaprakasapilai, "Pitcl~ Syncltrollo- s Multi-Band (PSMB)
Speec/t Coding", IF~J~E Proc. ICASSP-95, pp. 5~6-519, Detroit, May 1995) in
which a CELP ~code-excited linear prediction) based coding scheme is used to
encode speech period segments, rnulti band LrC coding (S. Yeldener, M. Kondoz
and G. Evans, "Hi~l~ Q~ality Mul~iband LPC Co~ling of Speecl~ nt 2.4 kbits/s",
Electronic Letters, pp. 1287-1289, Vol. 27, No 14, 4th July 1991) in which a single
~ amplitude value is allocated to each trame to in effect specify a "flat" rcsidual
. ~ spectrum, and harmonic and noise coding (M. Nishiguchi and J. Matsumoto,
"~Iarmo~ic nnd Noise Codil~g of LPC Residunls IVi~/~ Classif ed Vector

- - - - - -
CA 02259374 1998-12-29
PCT/GB9i/01~3 1
T WO 98/01848
QrJn~1fi~0tion~ EE rroc. ICASSP-95, pp. 484-487, Dctroit, May 1995) with
classified vector quantization which opcrates in thc LPC rcsidual domain, an
input signal being classificd as voiccd or unvoiccd and being full band modelled.
A further type of coding system exists, that is the prototype interpolation
coding system. This rclies upon the use of pitch period segments or prototypes
which are spaced apart in timc and rciteration!interpolation tcchniques to
synthesise the signal bchYeen two prototypes. Such a system was described as
early as 1971 (J.S. Severwight, "Interpolation I~citerations Techniques for
~:fficient Speech Tr~n~mi~sion", rh.D. Thesis, Loughborough University,
Department of Electrical Engineering, 1971). More sophisticated systems ofthe
same general class have been dcscribed more recently, for cxample in thc paper
by W.B. Kleijn, "Continuous Representations in Lincar Predictivc Coding, Proc.
lC~SSP-91, pp21~ ;!04, May 1991. The same author has published a series of
related papers. The systcm employs 20msecs coding framcs which are classified
as voiced or unvoiced. IJnvoiced frames arc effectively Cl~:LP coded. Pitch
prototype segments are defined in adjacent voiced frames, in the LPC residual
signal, in a way which cnsures maximum alignment (correlation) of the
prototypes and defines the prototype so that the main pitch excitation pulse is
not near to either of the ends of the prototype. A pitch period in a given frame is
considered to be a cycle of an artificial periodic signal from which the prototype
for the frame is obtained. The prototypes which have been appropriately

CA 022~9374 l998-l2-29
W O98/01~48 PCT/GB97/018~1
selected from adiacent frames are l~ourier transformed and the resulting
~ coeffieients are coded using a differential vector quantization scheme.
With this scheme, durin~ synthesis of voiced frames, the decoded
prototype Fourier representations for adjacent frames are used to reconstruct
the mi~sing signal waveform l)etween tlle two prototyl)e segments using linear
interpolation. Thus the residual signal is obtained which is then presented to an
LPCI synthesis filter the output of which provides tlle synthesised voiced speech
signal. An amount of randomness can be introduce~l into volced sr~eech by
injecting noise at frequencies larger than 2khz, the amplitude of the noise
increasing with frequency. In addition, the periodicity of synthesised voiced
speech is controlled during tlle quantization of prototype parameters in
aceordance with a long term signal to change ratio measure that reflects the
similarity whieh exists ~etween the prototypes of adjacent frames ilt the residual
excitation signal.
The known prototype interpolation coding systems rely upon a Fourier
Series synthesis equation which involves a linear-with-time-interpolation proeess.
The assumption is that the pitch estimates for successive frames are linearly
interpolated to provide a pitch function and an associated instant fundamental
frequeney. The instant phase used in the eosine and sine terms of the l?ourier
series synthesis equation is the integral of the instantaneous harmonic
~ frequeneies. This synthesis arrangement allows for the linear evolution of the

CA 02259374 1998-12-29'--''''- ' =' ' -'~ ''''
WO 98/01~48 PCT/GB97/01~31
instantaneous pitch and the non-linear evolution of the instantancous harmonic
frequencies.
~ development of this system is dcscribed by W.B. Kleijn and J. Ha~-lçn,
"A Speech Coder Based on Decomposition of Characteristics Waveforms", Proc.
ICASSP-95, pp5~-511, Detroit, May 1995. In the described system the Fourier
series coefficients are low pass filtercd over time, with a cut-off frequency of
20Hz, to provide a "slowly cvolving" ~aveform component for thc LPC
excitation signal. Thc dif~rence bchveen this low pass component and the
original parameters provides the "rapidly evolving" components of the excitation
signal. Periodic voice excitation signals are mainly represented by thc "slowly
evolving" component, ~vhereas random unvoiccd cxcitation signals are
represented by the "rapidly evolving" component in this dual decomposition of
the Fourier series coefficients. This rcmoves effectively the need for treating
voiced and unvoiced frames separatcly. Furthermorc, the rate of quantization
and tr;~n~mic~ n of the two components is different. The "slowly evolving"
signal is sampled at relatively lon~ intervals of 25msecs, but the parameters are
quantized quite accurately on the basis of spectral magnitndc information. In
-contrast, the spectral magnitude of thc "rapidly cvolving" signal is sampled
frequently, every 4msecs, but is quantized less accurately. Phase information is
ra~domised every 2msecs.
Other developments of the prototype interpolation coding systcm have
been proposed. ~or examplc one kno~vn system operates on 5msec frames, a

CA 022~9374 1998-12-29
. ~ W O98/01848 PCT/GB97/01~1
pitch period being sclccted for voiced frames and DliT transformcd to yield
prototype spectral magnitudc valucs. Thcse valucs arc quantizcd and the
quantized valucs for adjaccnt fi-amcs are lincarly intcrpolatcd. rhase
information is defined in a manncr whicll does not satisfy any frcquency
restrictions at the intcrpolation boundarics This causcs problems of
discontinuity at framc boundaries. At thc rcceiver the excitation signal is
synthesised using a dcco~lcd magnitudc and cstimatcd phase values, via an
inverse DFT process. Thc resulting signal is filtcred by a following LPC
synthesis filter. This modcl is ~urcly pcriodic during voiced spcecll, and this is
why a very short duration framc is uscd. Unvoiced speech is Cll:LP coded.
The wide range of speech synthesis models currcntly being proposed, only
some of which are dcscribcd abovc, and the range of alternativc approaches
proposed to implement thosc models, indicates thc interest in such systems and
the lack of any conscnsus as to whicll system providcs thc most advantageous
performance
It is an object of thc prescnt invention to provide an improved low bit rate
speech synthesis system.
In known systems in which it is necessary to obtain an estimate of the
pitch of a frame of a spccch signal, it has been thought neccssary, if high quality
of synthesised speech is to be achievcd, to obt,tin high rcsolution non-integer
pitch period estimates. This rcquircs complex processes, and it would bc highly

CA 022~9374 1998-12-29
WO98/01848 PCTIG~9710~1
~ 10
desirable to reduce the complexity of the pitch estimation process in a manner
which did not result in degraded quality.
Aecording to a ~Irst aspect of the present invention, therc is provided a
speech synthesis system in ~vhic}l . speech signal is divided into a series of
frames, and each frame is convcrted into a coded signal including a
voicedlunvoieed classification and a pitch estimatc, wherein a low pass filtered
speech segment eentrcd about a reference samplc is defined in each frame, a
correlation value is calculated for each of a series of candidate pitch estimates as
the maximum of multiple crosscorrelation values obtained from variable length
speech segments centred about the reference'sample, tlle correlation values are
used to form a correlation function defining peaks, and the locations of the
peaks are determined and used to define a pitch estimate.
The result of the above system is that an integer pitch period value is
obtained. The system avoids undue complexitv and may he readi'h~ implemented.
Preferably the pitch estimate is defined using an iterative process. A
single reference sample may be used, for example centrcd with respect to the
respective frame, or alternatively multiple pitch estimatcs may be derived for
each frame using different reference samples, the multiple pitch estimates being
combined to define a combined pitch estimate for the frame. The pitch estimate
may be modified by reference to a voiced/unvoiced status and/or pitch estimates
of adjacent frames to define a final pitch estimate.

CA 022~9374 1998-12-29
j I ~ WO 9X/01848 PCT/GB97/01831
The correlation function may bc clipped using a threshold value,
remaining peaks being rejected if they are adjacent to largcr peaks. Peaks arc
initially sclccted and can bc rejccted if they are smaller than a following peak by
more than a predetermined factor~ for example smaller than 0.9 timcs the
following peak.
Preferably the pitch estimation procedure is based on a least squares
crror algorithm. Prcferably thc algorithm defines the pitch as a number whose
multiples best fit thc correlation function pcak locations. Initial possiblc pitch
values may bc limited to integral numbers which are not consecutive, the
increment between two successive numbers being proportional to a constant
multiplied by the lower of those two numbers.
It is well known from the prior art to classify individual frames as voiced
or unvoiced and to process those frames in accordance with that classification.
IJnfortunately such a simple classification process docs not accurately reflect thc
true characteristics of speech. It is often the case that individual frames are
made up of both periodic (voiced) and aperiodic (unvoiced) components. Prior
attempts to address this problem have not provcd particularly effective.
- It is an object of the present invention to provide an improved voiced or
unvoiced classification system.
According to a second aspcct of the prcscnt invcntion there is provided a
speech synthesis systcm in which a speech signal is divided into a series of
frames, and cach frame is converted into a codcd signal inc~uding pitch segment
, _ _

CA 02259374 1998-12-29~
WO 98/01848 PCl'tGB97/01~1
magnitudc spectral information, a voiced/unvoiced classification, and a mixed
voiced classification which classifics harmonics in thc magnitude spectrum of
voiced frames as strongly voiccd or weakly voiccd, whcrcin a scrics of samples
ccntred on the middle of the framc arc windowcd to form a data array which is
Fourier transformed to producc a magnitudc spectrum, a thrcshold value is
calculated and used to clip the magnitude spcctrum, the clipped data is searched
to define peaks, the locatlons of pcaks are dctermined, constraints are applicd to
define donlin~nt peaks~ and harmonics not associated with a dominant peak are
c~ified as weakly voiced.
Peaks may be located using a second order polynomial. The samples may
be ~:-rnming windowed. The threshold value may be calculated by identifying
the maximum and minimum magnitude spccl- Ul-l values and defining the
threshold as a constant multiplied by the difference between the maximum and
minimum values. Peaks may be dcfined as thosc valucs which arc greatcr than
the two adjacent values. A pcak may bc rcjected from consideration if
neighbouring peaks are of a similar magnitude, e.g. more than 80% of the
ms-gnit~rll, or if there are spectral rn~gnitndes in the same range of greater
magnitudes. A harmonic may bc considercd as not being associated with a
dominant peak if the difference between two adjacent pcaks is greater than a
predetermined threshold valuc.
The spectrum may bc divided into bands of fixed width and a
strongly/weakly voiced classification assigned for each band Alternatively the

-
CA 02259374 1998-12-29
, I W O 98/01848 PCT/GB97/0 ~ 1
frequency range may be divided into two or more bands of variable width,
adjacent bands being separated at a frequency selected by reference to the
strongly/weakly voiced classification of harmonics.
Thus~ the spectrum mav be divided into fixed bands, for example fixed
bands each of SOOHz, or variable ~idth I)ands selected in dependence upon the
strongly/weak~y voiced status of harmonic components of the exeitation signal. ~
strongly/wealcly voiced classification is then assigned to eaeh band. The lowest
frequeney band, e.g. O-SOOHz, may always be regarded as strongly voieed,
whereas the highest fre~ueney band, for example 3500Hz to 4000Hz, may al~ays
be regarded as weakly voieed, In the event that a eurrent frame is voieed, and
the previous frame is unvoieed, otller bands within the current frame, e.~.
3000~z to 3500EIz may be automatieally classified as weakly voiced. Generally
the strongly/weakly voiced classifieation may be determined using a majority
decision rule on the strongly/weakly voiced classi~cation of tllose harmonics
whieh fall within the band in question. If there is no majority, alternate bands
may be alternately assigned strongly voiced and weakly voiced classifications.
Given the classification of a voiced frame such that harmonics are
classified as either strongly or weakly voiced, it is neeessary to generate an
it~tion signal to recover the speech signal which takes into account this
classification. It is an object of the present invention to provide such a system.
Aeeording to a third aspect of the present invention, there is provided a
speeeh synthesis system in which a speech signal is divided into a series of

CA 022~9374 1998-12-29
W O 98/01~48 PCT/G~9710 ~ 1
14
frames, each frame is defined ~s voieell or unvoiced, each frame is converted into
a eoded signal ineluding a pitch period value, a frame voiced/unvoieed
classifieation and, for eaeh voieed frame, a mixed voieed speetral band
elassifieation whieh elassi~les harmonics within spectral bands as either strongly
or weaWy voieed, and the speecll sit nal is reconstructed by generating an
exeitation signal in respect of each frame and applying the excitation signal to a
filter, wherein for each weakly voiced speetral band, an excitation signal is
generated which ineludes a random eomponent in the form of a function which is
dependent upon the respective pitch period value.
Thus for eaeh frame whieh has a speetral band that is classified as weakly
voiced, the exeitation signal is represented by a funetion which includes a first
harmonic frequency eomponent, the frequency of which is dependant upon the
piteh period value appropriate to that frame, and a seeond random eomponent
whieh is superimposed upon the first eomponent.
The random eomponent may he introdueed by reducing the amplitude of
harmonie oseillators assigned the ~eakly voiced classification, for example by
redueing the power of the harmonies by 50%, while disturbing the oseillator
frequeneies, for example by shifting the oseillators randomly in frequeney in the
range of O to 30 Hz sueh that the frequeney is no longer a multiple of the
fundamental frequency, and then adding further random signals. The phase of
the oseillators producing random signals may be randomised at pitch intervals.

CA 022~9374 1998-12-29
' W O98/01~48 PCTIGB97/0~1
Thus for a ~veakly voiced band, some periodicity remains but thc power of the
periodic component is reduced and then combined with a random component.
In a speech synthesis system in which a speech signal is represented in
part by speetral information in tlle form of harmonic magnitudc values, it is
possible to process an input speech signal to produce a series of speetral
magnitude values and then to use all of those magnitude values at harmonic
loeations in subsequent proeessing steps. In many eireumstanccs however at
least some of the magnitude values eontain little information which is useful in
the recovery of the input speech signal. Aceordingly ~Yhen magnitudc values are
quantized for transmission to a rcceiver it is sensible to diseard magnitude values
whieh eontain little useful information.
In one known system an input speech signal is proeessed to produee an
LPC residual signal whieh in turn is proeessed to provide harmonic magnitude
values, but only a fixcd number of those magnitude values is vector qu~nti7e-1 for
tr~ncmiccion to a receivcr. The disearded magnitude values arc represented at
the reeeiver as identical constant values. This known system reduees
redundancy but is inflexible in that the locations of the fixed number of
magnitude values to be ~uantized are always the same and predctermined on the
basis of assumption that may be inappropriate in partieular circumstances.
~ t is an object of the present invention to provide an improvcd magnitude
value quanti_ation system

CA 02259374 1998-12-29
WO 98/018~8 PCT/GB97/0~1
16
According to a fourth aspcct of the present invention, thcrc is provided a
speech synthesis system in which a speech signal is dividcd into a series of
frames, and each voiced frame is converted into a coded signal including a pitch
period value, LPC coef~lcicnts, and pitch scgment sl~ectral magnitude
information, wherein the spectral magnitudc information is quantized by
sampling the LPC short tcrm magnitude s~ectrum at harmonic frequencies, the
locations of the largest spcctral samples arc detcrmincd to identify which of the
magnitudes are relatively more important for accurate quantization, and the
magnitudes so identified are selected and vector quantized.
Thus rather than rclying upon a simple location selection strategy of a
fixed number of magnitude values for quantization and transmission, for
example the "low part" of the magnitude spectrum, the invention selects only
those values which make a significant contril~ution according to the subjectively
important LPC m~gnitude s~ectrum, thereby rcducing redundancy without
compromising quality.
In one arrangement in accordance with the invention a pitch segment of
Pn ~PC residual samples is obtained, where l'n is the pitch period value of the
nth frame, the pitch segment is DFT transformed, the mean value of the
res.~lt:~nt spectral magnitudes is calculated, the mean value is quantized and used
as a norm~lic~fion factor for the selected magnitudes, and the resulting
normalised amplitudes are quantized

CA 022~9374 1998-12-29
W O98/01848 PCT/GB97/0~1
Alternatively, the RMS value of the pitch segment is calculated, the RMS
value is quantized and used as a normalisation faetor for the selected
m~gnitndes, and the resulting normalised amplitudes are quantized.
At the reeeiver, the seleeted magnitudes are recovered, and each of the
other magnitude values is reproduced as a constant value.
Interpolation coding systems which employ a pitch-related synthesis
formula to recover speech generally encountor the problem of coding a variable
Iength, pitch dependant speetral amplitude vector. The quantization scheme
referred to above in which only the magnitudes of relatively greater impor~ance
are q~n~i7f~l avoids this problem by quantizing only a fixe~ number of
m~-gnit-lde values and setting the rest of the magnitude values to a constant
value. Thus at the receiver a fixed Iength vector can be recovered. Such a
solution to the problem however may result in a relatively spectrally nat
exeitation model which has limitations in providing high recovered speech
quality.
ln an ideal world output speech quality would bc maximised by
quantizing the entire shape of the magnitude spectrum, and various approaehes
have been proposed for coding the entire magnitude spectrum. In one approaeh,
the spccLI UL~ is D~T transformed and coded differentially across successive
speetra. This and similar coding schemes are rather inefficient however and
operate with relatively high bit rates. The introduetion of vector quantization

CA 022~9374 1998-12-29
W O98/01848 PCT/GB97/01~31
allowed for the development of sinusoidal and prototvpc interr)olation systems
which operate at lower bit ratcs, typicall~ around 2.4Kbits/scc.
Two vector quantization methodologies havc bccn rcported which
q~l~nt;7e a variable size input vector with a fixed size code vector. In a first
approach, the input vector is transformcd to a fixed sizc vector which is then
conventionally vector quantized. ~n inverse transform of the quantized fixed
size vector yields thc rccovered quantized vector. Transformation techniques
~vhich have been used include lincar interpolation, band limited interpolation? all
pole modelling and non-square transformation. This approach however
produces an overall distortion which is the summation of the vector quantrzation
noise and a component which is introduced by thc transformation process. In a
second known approach, a variable input vcctor is directly quantized with a
fised size code vector. This approach is based on selecting only a limited number
of elements from each codebook vector to form a distortion measure between a
codebook vector and an input vector. Such a quantrzation approach avoids the
transformation distortion of thc alternative tcchnique mentioned abovc and
results in an overall distortion that is equal to the vector quantization noise, but
this is significant.
It is an obiect of the present invention to provide an improved variable
sized spectral vector quantization scheme.
According to a fifth aspect of the present invention, there is provided a
speecb synthesis system in which a variablc size input vcctor of coeMcients to be
-

CA 022~9374 1998-12-29
.I ' W O 98/01~48 PCT/GB97/0 ~ 1
19
transmitted to a receiver for tlle reconstruction of a spcech signal is vector
quantized using a codebook defined b~ vectors of fixed size, the codebook veetors
of fi~ed size are obtained from variable size training vectors and an interpolation
teeh~ique which is an integral part of the codebook generation process, codebook
vectors are compared to the variable sized input vector using the interpolation
proeess, and an index assoeiated with the codebook entry with the smallest
differenee from the comparison is transmitted, the index being used to address a
further codebook at the reeeiver and thereby derive an assoeiated fixed size
eodebook veetor, and the interpolation proeess being used to reeover from~the
derived fixed sized codebook vector an approximation of the variable sized input
vector.
The invention is applieable in particular to pitch synchronous low bit rate
coders of the type deseribed in this doeument and takes advantage of the
underlying prineiple of sueh eoders which means that the slu.lpe of the magnitude
speetrum is represented by a relatively small number of equallv spaeed samples.
Preferably the interpolation proeess is linear. For an input veetor of
given dimension, the interpolation proeess is applied to produce from the
eodebook veetors a set of veetors of that given dimension. A distortion measure
is then derived to eompare the interpolated set of vectors and the input veetor
and the eodebook vector whieh yiel~ls the minimum distortion is seleeted.
Preferably the dimension of the input vectors is reduced by taking into
aeeount only the harmonic amplitudes with the input brandwidth range, for

CA 02259374 1998-12-29
' W O98/01848 PCT/GB97/0~1
example O to 3.4kHz. rrcferably thc r~m~ining amplitudes i.e. in the region of
3.4kHz to 4 kHz are sct to a constant value. I'referably, thc constant value is
equal to the mean value of thc quantized amplitudes.
Amplitude vectors obtained from ad3acent rcsidual frames exhibit
signific~nt amounts of redundancy which can be rcmoved by means of backward
prediction. Thc backwarcl prediction may be performed on a harmonic basis
such that the amplitude value of cach harmonic of one framc is predicted from
the amplitude valuc of thc samc harmonic in the previous frame or frames. A
fised linear predictor may be incorporated in the system, togethcr with Ihean
removal and gain shape quantization processes which o~erate on a r~nltin~
error m~gnitnde vector.
Although the above describe~ variable sized vector quantization scheme
provides advantageous characteristics, and in particular provides for good
perceived signal quality at a bit rate of for example 2.4Kbits/sec, in some
enYironments a lower bit rate w ould bc highly desirable even at the loss of some
quality. It would be possible for example to rcly upon a single value
representation and quantization stratcgy on the assumption that the magnitnde
SpC.I~ of the pitch segment in the residual domain has an approximately flat
shape. Unfortunately systems based on this assumption havc a rather poor
decoded speech quality.
It is an object of thc present invention to overcome the above limit~ion in
lower bit rate systems.

CA 02259374 1998-12-29
' ! ' W~ 98/01~48 PCT/GB97~0 ~ 1
Aecording to a sixth aspect of the present invention, there is provided a
speeeh synthesis system in whieh a speech signal is divided into a series of
frames, each frame is eonvcrted into a eoded signal ineluding an estimated piteh
period, an estimate of the energy of a speeeh segment the duration of whieh is a
funetion of the estimated pitch period, and LPC filter eoefficients defining an
LPC speetral envelope, and a speeeh signal of related power to the power of the
input sl)eech signal is reeonstructed hy generating an excitation signal using
speetral amplitudes which are defined from a modified LPC speetral envelope
sampled at the harmonie frequeneies defined by the piteh period.
Thus, although a single value is used to represent the spectral envelope of
- the exeitation signal, the excitation speetral envelope is shaped aecording to the
LPC speetral envelope. The reslllt is a system which is ca~able of delivering high
quàlity speeeh at l.~Kbitslsee. The invention is based on the observation that
some of the speeeh spc-ll uu. resonance and anti-resonance information is also
present in the residual magnitude spectrum, sinee LPC inverse filtering eannot
produee a residual signal of absolutely flat magnitude spectrum. As a
consequence, the LPC residual signal is itself highly intelligible.
The magnitude values may be obtained by spectrally sampling a modified
LPC synthesis filter charaeteristic at the harmonic loeations related to the pitch
period. The modified LPC synthesis filter may have reduccd feed back gain and
a frequeney response which eonsists of equalised resonant peaks, the loeations of
which are elose to thc L~C synthesis resonant loeations. The value of the feed

CA 022~9374 1998-12-29
W 098/01848 PCTIGB~71~1
back gain may be controlled by thc l)ertormancc of t~le L rc modcl such that it is
for example proportional to thc normalised LPC prediction error. Thc energy of
the reproduced speech signal may be equal to thc energy of thc original speech
waveform.
It is well known that in prototypc interpolation coding speech synthcsis
systems there are often substantial similarities bctween the prototypes of
adjacent frames in thc residual excitation sigr~lc- This has been used in various
systems to improvc perceived spcech quality b~ ensuring that therc is a smooth
evolution of the speech signal ovcr time.
It is an object of the present invention to proYide an improved speech
synthesis system in which thc excitation and vocal tract dynamics are
sl-bsf~n~-~lly l7reserved in the recovercd speech signal.
According to a seventh aspect of the present invention, there is provided a
speech synthesis system in which .- speech signal is dividcd into a series of
frames, each fra-m-e is converted into a coded signal including LPC filter
coefficients and at lcast one parameter associated with a pitch segment
m~gni~nde, and the speech signal is rcconstructed by generating two excitation
signals in respect of each frame, each pair of excitation signals comprising a first
excitation signal generated on the basis of the pitch segment magnitude
parameter or parameters of onc framc and a second excitation signal generated
on the basis of the pitch segment magnitudc parameter or parameters of a
second frame which follows and is adjacent to the said one frame, applying the

CA 022~9374 l998-l2-29
' ' ! W~ 98/01~48 PCT/GB97/0~1
first excitation signal to a first LPC filtcr thc characteristics of which are
determined by the LrC filter coefficicnts of the said onc frame and applying the
second excitation signal to a sccond LPC filter tllc characteristics of which are
determined by the Ll'C filtcr cocfficients of the said sccond frame, and weighting
and combining the outputs of thc first and sccond LPC filters to produce one
frame of a synthesised speech signal.
Preferably the first and sccond excitation signals include the same ~hase
function and different phase contributions from thc two LrC filtcrs involved in
the above double synthesis process. This reduces the degree of pitch period'icity
in the recovered ci~r~l~ This and thc combination of the first and second LPC
filter outputs ensures an effective smooth evolution of the speech spectral
envelope on a sample by samplc basis.
Preferably the outputs of the first and second LPC filters are weighted by
half a window function such as a Hamming window such that the magnitude of
the output of the first filter is decreasing with time and thc magnitude of the
output of the second filter is increasing with timc.
According to an eighth aspect of the present invention, there is provided a
speech coding system which operates on a frame by frame basis, and in which
information is transmitted which represents each frame as either voiced or
unvoiced and, for each voiced frame, represents that frame by a pitch period
value, quantized magnitude spectral information, and LPC filter coefficients, the
received pitch period valuc magnitude spectral information being used to

CA 02259374 1998-12-29
PCTIGB97/0 ~ 1
WO98/01848
24
generate residual signals at thc reccivcr w11ich al c a~ lied to LrC speeeh
synthesis filters the eharacteristies of ~vhich are determined by the transmitted
filter eoeffieients, wherein eaeh residual signal is synthesised aceording to a
sinusoidal mixed exeitation synthesis proeess, and a recovered speech signal is
derived from the residual signals.

CA 022~9374 1998-12-29
i W O98/01848 PCT/GB97/0~1
Embo~liments of the present invention will now be deseribed, by way of
example, with referenee to the accomp.lnying dra~vings, in whieh:
Figure 1 is a general block diagram of the encoding proeess in aeeordanee
with the present invention;
Figure 2 illustrates the relationship between eoding and matrix
quantisation frames;
Figure 3 is a general bloel; diagram of the deeoding proeess;
Figure 4 is a bloek diagram of the exeitation svnthesis proeess;
Figure 5 is a sehematie diagram of the overlap and add proeess;
Figure 6 is a sehematie diagram of the calculation of an instantaneous
sealing faetor;
Figure 7 is a bloek diagram of the overall voiced/unvoieed elassification
alld piteh estimation process;
Figure 8 is a block diagram of fhe pitch estimation proeess;
Figure 9 is a schematic diagram of two speech segments which participate
in the ealeulation of a crosscorrelation function value;
Figure 10 is a schematic diagram of speeeh segments used in the
caleulation of the erosseorrelation funetion value;
Figure 1 I represents the value allocated to a parameter used in the
calculation of the crosscorrelation funetion value for different delays;
Figure 12 is a block diagram of the process used for ealculated the
erosseorrelating function and the seleetion of its peaks;
Figure 13 is a flow ehart oi a pitch estimation algorithm;

-
CA 02259374 1998-12-29
W O 98101848 PCT/GB97/0~1
26
Figure 14 is a flow chart of a proeedure uscd in tllc pitcll estimation
proeess;
Figurc 15 is a flow chart oi a furthcr r~roccdure used in thc piteh
estimation proeess;
Figure 16 is a flow cllart of a furthcr proccdure used in the pitch
estimation process.
Figure 17 is a flow chart of a thrcshold valuc sclcction procedure;
Figure 18 is a flow chart of the voiced/unvoiccd classification process;
Figure 19 is a schematic diagram of thc voiced/unvoiced classifieation
process with respeet to parametcrs generatcd during the piteh ~s~imati-ln
proeess;
Figure 20 is a flow chart of the procedure used fo dctcrmine offset values;
Figure 21 is a flow chart of the lliteh estimation algorithm;
Figure 22 is a flow chart of a proccdurc used to imposc constraints on
output pitch estim~te-s to ensurc smooth evolution of piteh values with time;
Figures 23, 24 and 25 reprcsent different portions of a ~low chart of a
pitch post proeessing proeedure;
Figure 26 is a general bloek diagram of the LPC analysis and LPC
quantisation proeess;
Figure 27 is a general flow chart of a strongly or weakly voiecd
elassifieation proeess;

CA 02259374 1998-12-29
i ' W O 98101848 PCT/GB97/0~1
Figurc 28 is a flo-v chart of tlle pl-ocedure rcsponsible for the
stronglylweakly voiced classification
Figure 29 reprcsents a specch wavcform obtaincd from a particular
speech utterance;
Figurc 30 shows frcqucncy tracks obtained for the speech utterance of
Figurc 29;
Figure 31 shows to a largcr scale a portion of Figure 30 and rcpresents the
differcnce bctween strongly and ~cakly voiced classifications;
Figure 32 shows .- magnitudc spcctrum of a particular speech seg~nent
and the corresponding LrC spectral envelopc and thc norm~ ed short term
magnitude spectra of the corrcsponding residual segment, excitation segment
obtained using a binary excitation modcl and an excitation segment obtained
using the stronglytweakly voiced model;
Figure 33 is a general block diagram of a system for reprcsenting and
quanffsing magnitude information;
Figure 34 is a block diagram of an adaptive quantiser shown in Figure 33;
Figure 35 is a general block diagram of a quantisation process;
Figure 36 is a general block cliagram of a differential variable size
spectral vector quantiser; and
Figure 37 represents the hierarchical structure of a mean gain shape
quantiser

CA 02259374 1998-12-29
W O98/018~8 PCT/GB97/0~1
28
A system in accordance with the present invention is described below, firstly in general terms
and then in greater detail. The system operates on an LPC residual signal on a frarne by
frame basis.
Speech is synthesised using the following general expression:
s(i ) = ~ A k (i) COS(~ k (i ) + lli) I ) ( l )
k=0
where i is the sarnpling instant and Ak(i) ~ sellts the amplitude value of the kth cosine term
COS(~)k (i)) (Wi~l ~)* (i) = ~ k (i) + ~ ) as a ~unction of i In voiced speech K depends on the
pitch frequency of the signal.
A voiced/unvoiced classification process allows the coding of voiced and unvoiced frames to
be handled in different ways. Unvoiced frarnes are modelled in terms of an RMS value and a
ra~ldom time series. In voiced frames a pitch period esfim~t~ is obtained and used to define a
pitch se~m~nf which is centred at the middle of the frame. Pitch segm~t~ from ~7rl3~nt
frames are DFT transforrned and only the resulting pitch egm~nt m~gnitll~le information is
coded and L,~ l Furthermore, pitch segment m~nit~ sarnples are classified asstrongly or weaWy voiced. Thus in addition to voiced/unvoiced information, the system
...;t~ for every voiced frame the pitch period value, the magnitude spectral inforrnation of
the pitch segm~nt~ the stronglweak voiced classification of the pitch magnitude spectral
values, and tne LPC coefficient. Thus, the inforrnation which is trz n~mitted for every voiced
frarne is, in addition to voiced/unvoiced inforrnation, the pitch period value, the m~gnitll~le
spectral inrc ~ ion of the pitch segment, and the LPC filter coefficients.
At the receiver a synthesis process, that includes interpolation, is used to reconstruct the
waveforrn behveen the middle points of the current (n+l)th and previous nth ~ames. The
basic synthesis equation for the residual signal is:

CA 02259374 1998-12-29
PCT/GB97/0~I
: W O 98101~48
Res(i) - ~ MG~ cos~shase j (i)) (2)
,j.o
where MG~ are decoded pitch segment magnitude values and phasej~i) is calculated from the
integral of the linearly interpolated in~t~nt~ncous harmonic frequencies ~j(i). K is the largest
value of j for which ~Din(i)S7~.
In the transitions from unvoiced to voiced, the initial phase for each h~rmonic is set to zero.
Phase continui~ is preserved across the boundaries of successive interpolation intervals.
The synthesis process is perforrned twice however, once using the magnitude spectral values
M~jn I of the pitch segment derived from the current (n+l)th frame and again using the
m~nih-cle values MGjn of the pitch segrnent derived in the previous nth frarne. The phase
function phasei(i) in each case remains the same. The resulting residual signals Resn(i) and
Resn~l(i) are used as inputs to corresponding LPC synthesis filters calculated for the nth and
(n+l)th speech ~ames. The two LPC synthe~i~eci speech waveforrns are then weighted by
Wn+l(i) and Wn(i) to yield the recovered speech signal.
Thus the overall synthesis process, for successive voiced frames. can be described by:
S(i) = W" (i)~ Hn (~ j (i))MG;' cosjphasell (i) + (p " (~3 '; (i))]
K (3 )
+W"+,(i)~Hn+'(cl) j(i))~G;'~' cos[phase'~(i)+(p"+~(~',(i)).
. J_o
where Hn (/0 ~ (i)) is the frequency response of the nth frarne LPC synthesis filter calculated,
at the ~jn(i) harmonic frequency function at the ith instant. (p"(~3',(i)) is the associated
phase response of this filter. cojn(i) and phasqn(i) are the frequency and phase functions
defined for the sarnpling instants i, with i covering the middle of the nth frame to the middle
ofthe (n~l)th frarne se~m~nt~ K is the largest value of j for which o)j"(i)ST~.

CA 02259374 1998-12-i9
W 098/01848 PCT/GB97/018~1
The above speech synthesis process introduces two "phasc dispersion" terms i.e ~ J))
and ~pn~ of (f)) which effectively reduce the degree of pitch periodicity in the recovered
signal. In addition, this "double synthesis" arrangement followed by an overlap-add process
ensures an effective smooth evolution of tlle speech spectral envelope (LPC) on a sample by
sample basis.
The LPC excitation signal is based on a "mixed" excitation model which allows for the
ap~lol.Liate mixing of periodic and random excitation components in voiced frames on a
frequency-band basis. This is achieved by operating tlle system such that the magnitude
spectrum of the residual signal is e~min~-l, and applying a peak-picking process, near 'the C'~i
resonant freqllenrje~, to detect possible dominant spectral peaks. A peak associated with a
frequency ~j inrli~ tes a high degree of voicing (lG~lGse~lted by hvj=1) for that harmonic. The
absence of an ~ eent spectral peak, on the other hand, indicates a certain degree of
randomness (Ic~l~,3Gllted by hvj=O). When hvj=l (to indicate "strong" voicing) the
contribution of the jth harmonic to the synthesis process is MGf cos(phasel (i)) However,
when hvj=O (to indicate "weak" voicing) the frequency of the jth harmonic is slightly
dithered, its l~lagnilllde MGj is reduced to (MG, / ~) and random cosine terrns are added
symmetrically alongside the jth harmonic ~ j. The terms "strong" and "wealc" are used in this
sense below. The nurnber NRS of these random terms is
Nli 5' = 2 x (d o (4)
4~ x ('SO/fs)
where r 1 indicates rounding off to the next larger integer value. Furthermore. the NRS
random components are spaced at 50 Hz intervals symmetrically about ~ ~ j being located in
the middle of such a 50 Hz interval. The amplitudes of tlle NRS random components are set
to (MG, / J2 x NRS) Their initial phases are selected randomly from the ~-7r, +1r] region at
pitch period intervals.

CA 02259374 1998-12-29
PCT/GB97/0183 1
? wo 98tO1848
The hvj information must be tr~ncmitted to be available at the receiver and, in order to reduce
the bit rate allocated to hvj, the bandwidth of the input signal is divided into a nurnber of
fixed size bands BDk and a '~strongly" or "weakly" voiced flag Bhv~; is ~ssi~ne~l for each
band. In a "strongly" voiced band, a highly periodic signal is reproduced. In a "weakly"
voiced band, a signal which combines both periodic and aperiodic components is required.
These bands are clacsified as skongly voiced (Bhvk=l) or weakly voiced (Bhvk=0) using a
majority decision rule approach on the hvj classification values of the harrnonics co~ contained
~,vithin each fre~luency band.
Further restrictions can be imposed on the stron~ly/weakly voiced profiles resulting frorn the
classification of bands. For example, the first A bands may always be strongly voiced i.e.
hvj=1 for BDk with k=1,2,...,~, and ~ being a variable. The rem~ining spectral bands can be
strongly or weakly voiced.
Figure 1 s~hPm~tically illustrates processes operated by the system encoder These processes
are referred to in Figure 1 as Processes 1 to V~I and these terms are used throughout this
doc--m~nt ~igure 2 represents the relationship between analysis/coding frame sizes
employed. These are M sarnples per coding frarne, e.g. 160 sarnples per frame, and k frarnes
are analysed in a block, for exarnple k=4. This block size is used for matrix qn~nti~tion. A
speec~l signal is input and processes 1, III, IV. VI AND VII produce outputs for tr~ncm-scion.
~ccllmin~ that the first Matrix Q-l~nti~tion analysis frarne (MQA) of kxM sarnples is
available, each of the k coding frames within the MQA is classified as voiced or unvoiced
(V~,) using, Process L. A pitch estimation part of Process I provides a pitch period value P,
only when a coding ~arne is voiced.

CA 02259374 1998-12-29
PCTIGB97101~1
WO 98101848
Process 11 operates in parallel on the input speech samples and estim~3tPs p LPC filter
coefficients a (for example p=l 0) every L sarnples (L is a multiple of M i.e. L-mxM, and m
rnay be equal to for example 2). ln addition, Wm is an integer and ,~esents the frarne
~lim~on~ion of the matrix q-l~nti7Pr employed in Process rII. Tllus the LT'C filter coefficients
are 4u~t~ cl, using Process III and transmitted. The qll~nti7f d coefficients â are used to
derive a residual signal l~n(;),
When an input coding frame is unvoiced, the Energy E" of the residual obtained for this
frame is calculated (Process VII). ~f~;~ is then q-l~nti7Ptl and transmitted.
When the nth coding frarne is classified as voiced, a segm~nt of Pn residual sarnples is
obtained (Pn is the pitch period value associated with the nth frarne). This segment is centred
in the rniddle of the frarne. ~he selected Pn sarnples are DFT transformed (Process V) to yield
r(pl~ + ~) / 21 spectral rn~nitllde values MGI, ~ o<i<r( P" + 1) / 21, and r(P" ~ 1) / 21 phase
values. The phase information is neglected. The magnitude information is coded (using
Process VI) and t~ A In addition a segment of 20 msecs, which is centred in the
middle of the nth coding frame, is obtained from the residual signal Rn(i). This is input to
Process IV, together with Pn to provide the strongly/weakly voiced classification parameters
hvj" of the harmonics cl~jn. Process IV produces q~ ti7~-d Bhv inforrnation, which for voiced
frarnes is multiplexed and tr~n~mitt~ d to the receiver together with the voicedfunvoiced
decision Vn, the pitch period Pn~ the 4, .~ d LPC coefficients â of the corresponding LPC
frarne, and the m~ t~ e values MG,'. In unvoiced frames only the~/~;; quantized value
and the 4.~ 1 LPC filter coef~lcients â are transmitted.
Figure 3 s~h~m~fically illustrates processes operated by the system decoder. In general terrns,
given the received pararneters of the nth coding frame and those of the previous (n-l)th
coding frarne, the decoder synthesises a speech signal Sn(i) that extends from the middle of

CA 02259374 1998-12-29
' ' PCT/GB971018~1
~' ? W O 98/01848
the (n- I )th frame to the middle of the nth frame. This syntllesis process involves the
generation in parallel of two excitation signals Resn(i) and ReSn~ which are used to drive
two independent LPC synthesis filters I I .4" (z) and 1/ ~4"-1 (z) the coefficients of which are
derived from the transmitted qu~nfi~d coefficients a . The outputs Xn(i) and Xn l(i) of these
synthesis filters are weighted and added to provide a speech segment which is then post
filtered to yield the recovered speech Sn(i). Tl1e excitation synthesis process used in both
paths of Figure 3 is shown in more detail in Figure 4.
The process cornmences by considering the voiced/unvoiced status V~, where k is e~ual to n
or n-l, (see rigure 4). When the frame is unvoiced i.e. Vk=0, a ~ Csi~n random number
generator RG(0, 1 ) of zero mean and unit variance, provides a time series which is
subsequently scaled by the ~/~ vaiue received for this frame. This is effectively the
required:
Res~ (i) = J~x RG(O,I) (S)
signal wnich is then presented to the corresponding LPC synthesis filter 1/ ,4" (z), k=n or n-
1. Performance could be increased if the ~ value was calculated, q~ rlt;7~n and
tr~n~mittcd every Smsecs. Thus, provided that bits are available when coding unvoiced
speech, four ~ ,=0,..,3, values are kansmitted for every unvoiced frame of 20msecs
duration (160 sarnples).
In the case where Vk=l, the Resk(i) excitation signal is defined as the summation of a
"harmonic" Reskh(i) component and a "random" Reskr(i) component. The top path of the
Vk=1 pa~t of the synthesis in ~igure 4, which provides the harmonic component of this mixed
excitation model, calculates always the instantaneous harrnonic frequency function ~j"(i)
which is associated with the interpolation interval that is defined between the middle points
of the nth and (n-l)th frarnes. (i.e. this action is independent of the value of k). Thus, when

CA 022~i9374 1998 -12 - 29
PCT/GB97/0 18~ 1
WO 98/01~48
34
decoding the nth frarne, o~jn(i) is calculated using the pitch frequencies fjl ~', fj2 n and linear
interpolation i.e.
2,11
~ n (i) = 27r. J i J i i + 27tf j2~ (6)
with O ~ j < r(P",~,~ + 1) 121, O<i<M and P",~,~ = max[P" . P",
The frequencies, f~l'n and ~2~n are defined as follows:
IJ When both the nth and (n-l)th coding frames are voiced i e V"=l and Vn l=l, then the
pitch frequencies are ectim~te-l as follows:
a) lf ¦Pn--Pn ~¦ < 0-2 x (P" ~ Pn l) ~ )
which means that the pitch values of the nth and (n-l)th coding frames are rather
similar, then:
f~lJ~ = j p + (I--hv~' )x RU(--a,+a) (8)
f 2JI f,''"-'+ jh if (T/", = V, ~ ND ~P", - P" ~¦ ~ 0~2(P", + P"_t))
f ~ otherwise
The f~ value is calculated during the decoding process of the previous (n-l)th
coding fi~me. hvjn is the strongly/weakly voiced classification (0~ or 1) of the jth
harmonic ~jn. Pn and Pn l are the received pitch estimates from the n and n-l frames.
RU(-a,+a) indicates the output of a random nurnber generator with uniform pdf within
the -a to +a raDge. (a=0.00375)
b) if IP.--Pn_,¦ > ().2 x (Pn + P", ) ~10)
then f,'J'--j( p --b) + (I--hv'j')x RU(-a~+a) (11)
and fJ2.11 = fi ~n-l + b x j
where b is defined as:
o 7 ( P ~ P , ~
b = 2 x sg~ ~ --f j ) (12)

CA 02259374 1998-12-29
I t PCTfGB97/0~1
' i W O 98/01~48
Notice that in case (b) which applies for significantly different Pn and P,t I pitch estimates,
equations 11 and 12 ensure that tl-e rate of change of the ~)j (i) function is restricted to
( j ~~ P + Pl - I ) ) /M
II) When one of the two coding frames (i.e n, n-l) is unvoiced, one of the following two
definitions is app~icable:
a) for Vn l=O and Vn=
J2,1= 1 j ~)Sj ~"+~
and fjl~n is given by Equation (8).
b) for Vn ~=l and Vn=~
ij2~niS set to the ijl~~-l value, which has been calculated during the decoding process of
the previous (n- l )th coding frame and ~ 2~11
Ciiven ct)jn(i) the in~t~nt~neous ~unction phasej''(i) is calculated by:
phaseJ' = 2J~ ( ~ 2~Mf J )i 2 + 2'rcf j2'"i + pha*e 'j'- ' ( M) for O < j < ~ P,l", +
and O<i<M
Furthermore, the "harmonic" component Re Sk' (i) of tlle residual signal is given by:
~'2 1-
Res,~'(i) = ~ Cj(i) x M~j (hvJ: ) x cos[phase';(i)] O < i < M (14)
j~o
where k=n or n- 1,
O if C~ 'i (i) > 11
C;(t)=~I if(o'j(i) <7~
(MG~ ) forhv". = O d I < j < Ir 1
MG jl (hvk ) = ~ MGI~ for hv~ 2
O other~ise, including j = O
arld

CA 02259374 1998-12-i9
W 098/01848 PCTIGB97/01
36
Mâ, i=~,...,l(P,~ +1)/2~-1 are the received magnitude values of the "kth" coding
frarne, with k=n or k=n-l .

CA 02259374 1998-12-29
W O98/01848 PCT/G~97/0~1
The second path of the Vk=l case in Figure 4 provides tl1e random excitation Componènt
Res* (i). In particular, given the recovered strongly/weakly voiced classification values hvjk,
the system calculates for those harmonics with hY, =O tlle number of random sinusoidal NRS
components, which are used to randomise the corresponding harmonic. Tllisis:
NRS 2 ~i) (15)
where fs is the samp~ing frequency. Notice that the NRS random sinusoidal components are
located symmetrically about the corresponding harmonic c~ ~ and they are spaced 50 ~Iz
apart.
The i.,~ rous frequency of the qth random component, q=O,l,...,NRS-l, for the jth
h~rrnonic ~ ~ is calculated by:
(i) = (~ ) (i) + 2~ x (25/fs) + (4 --(NRS/2))x 27c x (50/ f~ f~r ~ ~ i < ~ 2 1 (16)
and O'i<M
The associated phase value is:
PhJ,g (i) = J Q jR ( ) i2 + i<i) k (03 + for O < j < ~ 2 1 (17)
and OSiSM
where r~ jq - RU(7l,-7r) . In addition, the Ph~"(i) function is randomised at pitch intervals
(i.e. when the phase of the fundamental harmonic component is a multiple of 2~, i.e.
mod~hase, (i), 21~ )= O ).
Given the Phf,q (i), the random excitation component Res~;r(i) is calculated as follows:
~r.""t~
2 ¦ NRS--I
Re Sk (i) = ~ ~ C~ q (i'J x MGk q (hvk ) x cos(Ph"~, (i)) O S i < M (18
~--o q~o
where
_ _

CA 02259374 1998 -12 - 29
,
WO 98/01848 PCT/GB97/01831
38
(M~ ;~)for hv~j = O an 1 c ~ r P~ + ~1
MGJq(hvJ )=< O for h~ c1 2
o o~heru~ise, including j = O
C~ i) > 7~
Thus for ~k=l voiced coding frames, the mixed excitation residual is forrned as: Res, (i) = Resk (i) + Resk (i) (19)
Notice that when Vk=O, instead of using E~quation 5. the random excitation signal Resk(i) can
be g~ ."t~cl by the sllmm~tion of random cosines located 50 Hz apart, where their phase is
r~nAomi~eA every ~ samples, and ~<M, i.e
8(~ 1--
Resk ~i) a ~¦ 40 cos(27~Cfs/~O)r ~ x ~ ) x RU(--~7~)) where
(20)
~,--û,1,2, ..,and O<i c M and
is defined so as to ensure that the phase of the cos terrns is randornised every ~ samples
across frame bolln-7~ri~ The resulting Resn(i) and Resn l(i) excitation sequences, see Figure
4, are processed by the c~ s~onding 1/ A" (z) and 1 ~ A"-, (z) I,PC synthesis filters. When
coding the next (n+l)th fi~ne, 1/ A"_,(z) becomes 1 /An(z) (ine,lll-lin~ the memory) and
11 An (Z) becornf~s 1/ A"+~ (z) with the memory of 1 / An (Z) . This is valid in all cases except
during an unvoiced to voiced transition, where the memory of the 1/ A"" (z) filter is set to
zero. The coeffic;ents of the 1 / An (Z) and 1/ A"_, (z) synthesis filters are calculated directly
~om the nth aIld (n-l)th coding speech frames respectively, when the LPC analysis frame
size L is equal to M samples. However, when L~M (usually L>M~ linear interpolation is used
on the filter coefficients (defined every L sampLes) so that the transfer function of the
synthesis filter is updated every M sarnples.

CA 02259374 1998-12-29
W O98/01848 PCT/GB97/01831
39
The output signals of these filters, denoted as X" l (i) and Xn(i), are weighted, overlapped and
added as sch~.m~tically illustrated in Figure S to yield X" (i) i.e:
Xn (i) ~1-1 (i) X~I l (i) + Wn (i) Xn (i )
where
054--0.46cos(2M I i) for O < i < M when Y" = Y"
Wn (i) = ~ ,
O for O < i c 0.25M
~ S _ 05 ~ --0 25M) for 0.25M C i c 0.75M ~ when Yn ~ Y"_
for 0.75M < i < M
(21)
and
054--0 46 cos( 2~ (f + M 05)) for O < i < M when Yn = Yn-
Wn l (f) =
for O < i c O.25M
05 + 05 s( i--0 25M) for 0.25M C i < 0.75M ~ when YN ;~ Y~t-~
O for0.75MCi< M
(22)
XN(i) is then filtered via a PF(z) post filter and a high pass filter HP(z) to yield the speech
segment S'n(i). PF(z) is the conventional post filter:
PF(z) = Ar' ~ ) (1--~-' )
(23J
with b=0.5, c=0.8 and 11=05k:;'.~,' is the first reflection coefficient of the nth coding
~ame. HP(z) is defined as:

CA 02259374 1998-12-29
W O 98/01848PCT/GB97/0 ~ 1
HP( ) b~--c~z~~ (24)
1--a~z
with bl=cl=0.9807 and al=0.96148 1.
In order to ensure that the energy of the recovered S(i) signal is preserved, as colllpal~,;l to
that of the X(i) sequence, a scaling factor SC is calculated every LPC frame of L s~mples
SC, = ~ (25)
L--I 4-1
where: E,=~X~ and E,=~S,(i) 2
/ - o ;..o
SCI is ~c~oei~t~-,A with the middle of the 1th LPC frarne as illustrated in Figure 6. The''filtered
sarnples from the middle of the (1-1 )th frarne to the middle of the 1th frarne are then multiplied
by SCl(i) to yield the final output of the system~ Sl(i)=SCl(i)xS'l(i) where:
SC,~i)=sc,w,(i3+sc"W"(i) O<i<L (26)
and
W, (i) = 05--05cos(~r L 1) ~ < i < L
W,_~(i)=05~05cos(~ L 1) O<i~L
The scaling process introduces an extra half LPC frarne delay into the coding-decoding
process.
The above described energy scaling procedure operates on an LPC frarne basis in contrast to
bo~ the cleco-ling and P~(z), HP(z) filtering procedures which operate on the basis of a fr,arne
of M samples.
Details of the coding processes le~esellted in Figure 1 will now be described.

CA 02259374 1998-12-29
W O 98/01$48 PCT/GB97/0~1
4l
Process I derives a voiced/unvoiced (V/UV) classification V" for the nth input coding frarne
and also assigns a pitch ~st;m~te Pn to the middle sarnple Ml1 of this frame. This process is
illustrated in Figure 7.
The V/UV snd pitch est~mation analysis frarne is centred at the middle Mn+l of the (n+l)th
coding frarne with 237 sarnples on either side. The signal x(i) in the above analysis frarne is
low pass filtered with a cut off frequency fC=1.45KHz and the resulting (-147, 147) samples
centred about Mn+l are used in a pitch ~stim~tion algorithrn, which yields an es~im~ PM ~,-
The pitch estim~tion algorithm is illustrated in Figure 8, where P represents the output of the
pitch estirnation process. The 294 input sarnples are used to calculate a crosscorrelation
function CR(d), where d is shown in Figure 9 and 20Sd<147. Figure 9 shows the two speech
s~ ; which participate in the calculation of the crosscorrelation function value at "d"
delay. In particular, for a given value of d~ the crosscorrelation function pd(j) iS calculated
forthe se~,.ff~ xL~d, ~xR}d,as:
~ ((X L (i)--X L XX ~d~ x '/ ))
pd ( j) _ 1-0 (2,7)
~ kd (i)--XL j ~ ~ (Xl. (i) Xr. )
where:
xLd (i)=x(M,2+l-d+j~i), xRd (i)=x(Mn+l+j+i), for O<i'd-J-l, j=O,l,...,f(d) (Figure 10
S~l~f ~ y lG~lCSelll:; the X ~ and XR~ speech segments used in the c~lculsltion of the
value CR(d) ~nd the non lillear relationship between d and f(d) is given in Figure I I XL and
xd r~lesellt the mean value of the {x~ ' d and {xR}d sequences respectively.
The algorithm then selects max~pd(j)] and defines CR(d)= max ~Pd(j)],20<dS147.
In addition to CR(d), the box in Figure 8 labelled "Calculation of CR function and selection
of its peaks", whose ~etz3iled diagram is sho~,vn in Figure 12, ~rovides also the locations loc(k)

CA 02259374 1998-12-29
W O98/01848 PCT/GB9710~1
42
of the peaks of the CR(d) fi~nctioll, where k=1,2,....Np and Np is the number of peaks in a
C~(d) function.
Figure 12 is a block diagrarIl of the process involving the calculation of the CR function and
the selection of its peaks. As illustrated, given CR(d), a thresllold th(d) is determined as:
th(d)=CR(dm~ b--(d--d"I~)xa--c (28)
where c=0.08when (V,'= 1),4Np~.d,"~X-P,'¦< 0.15x(d~ X +P" )pR(~"_~ =I)]
~ ND(d > 0.875 x ~ )ANr~(d < 1.125 x P,')
or c=0 elsewhere.
and c~ n.et~nt~ a and b are defined as:
b 0.025 0.04 0.05
a 0.0005 0.0005 0.(1006
Vn 1 1 0
Vn l 1 0 1/0
dm~5 is equal to the value of d for which CR(d) is maximised to CRA, " . Using this threshold
the CR(d) fiulctiorl is clipped to CRL(d). i.e.
L(d) =O for C~R(d)sth(d)
CRs,(dj=CR(d) otherwise.
CRI(d) co,~ s segmenS~ Gs s=1,2,3.. , of positive values separated by Go runs of zero
values. The algorithm examines the length of tlle Go runs which exist between successive Gs
se~ments (i.e. Gs and Gs+1), and when Go < 17, then the G~ segment with the max CR~,(d)
value is kept. This procedure yields CR, (d ), which is then examined by the following ~'peak
picking" procedure. In particular those CR, (d) values are selected for which:
CR~(d)>CR,(d-l) and CR,(d)~CR,(d+l)

CA 02259374 1998-12-29
W 098/01848 PCT/GB97/01831
43
However certain peaks can be rejected if:
CRL (loc(k)) < CRL (loc(k ~ 1)) x 0.9
This ensures ~at the final CR,(loc(k)) k=l,...,Np does not contain spurious low level
CR, (d) pealcs. The locations d of the above defined CRL(d) peaks are given by loc(k)
k= 1,2, . . .,Np.
CR(d) and loc(k) are used as inputs to the following Modified High ~esolution Pitch
F..ctim~tion algorithm (MHRPE) shown in Figure 8, whose output is PMn~,. The flow~ of
this MHRPE procedure is shown in Figure 13, where P is initialised with 0 and, at the end,
the ~stim~ted P is tne requested PMn~,. In Figure 13 the main pitch estimation procedure is
based on a Least Squares Error (LSE) algorithm which is defined as follows:
For each possible pitch value j in the range from 21 to 147 with an increment of 0.1 x j. i.e.
j ~ {21,~ 7,30,33,36,40,44,48,53,58,64,70,77,84,92,101,111,122,134} . (Thus 21 iterations
are ~c~rc,~ ed.)
13 Folm the multiplication factor vector:
Uj = [ 1 Ioc]
2) Reject possible pitch j and go back to (I) if
a) the same f lçm~nt occurs in Uj twice.
b) the elements of Uj have as a common factor a prime number.
3) Forrn the following error quantity
Ej =loc loc--2pjuj IOC+ pjUj Uj
where --T_
loc Uj
Pj _ - T
4) Select the Pjs value for which the associated Error quantity Ejs is mtnimllm
(i e. j~.Ej,~ < ~j ~j ~ {21,23,...134}). Set P=pjs.
_

CA 02259374 1998-12-29
W O 98/01848 PCT/GB97/01
44
The next two general conditions "Reject Higllest Delay" loc(Np) and "Reject Lowest Delay~
loc(l) are included in order to reject false pitch, "double" or "half" values and in general to
provide constraints in the pitch estimates of the system. The "Re3ect Highest Delay" condition
involves 3 con~it~-
iJ if P=0 then re3ect loc(Np).
ii) if loc(Np) ~100 then find the local m~imllm CR(d~m) in CR(d) at the vicinity of the
~stim~te~l pitch P (i.e 0.8xP to 1.2xP) and compare this with th(d~m), which is deterrnined
as in Equation 28 Reject loc(Np) when CR(d"n)<th(d,m)-0.02.
iii) If the error Ej5 of the LSE algorithm is larger than 50 and u,S (Np)=Np with Np>2 then
reject loc(Np).
The flowchart of this is given in Figure 14.
The "ReJect Lowest Delay" general condition, whose flowchart is given in Figure 15, rejects
loc(l) when the following three constraints are ~imllit~ ously satisfied:
i) The density of detection of the peaks of the correlation coefficient function is less than or
e~l to 0.75. i.e.
u" ( Np)
ii) If the loc?tion of the first peak is neglected (i,e, Ioc(1)), then the r~m~inin~ locations
exhibit a common factor.
iti) The value of the corrolation coefficient function at the locations of the mi.~sin~ peaks is
relatively small compared to ad~acent ~ tected peaks, i,e,
If Upnk-upn~k)>l, for k=l "..Np. then
for i=upn(k)+ I : Upn(k+ l )- I
a) find local maximum CR(d"n) in the range from (i-O l)xloc(l) to (i+O.l)x
loc(l ).
b) if CR(d~m) <0 97xCR(upn(k)) then Reject I ,owest nelay, END.
else Continuc

CA 02259374 1998-12-29
! ~. WO 98/01848 PCTIGB97/0~1
This concludes the pitch estim, tion procedure of Figure 7 whose output is PMn~,. As is also
illustrated in Figure 7 however, in parallel to the pitch c~tim,.tion, Process I obtains 160
sarnples centred at the middle of the Mn+l coding frarne, removes their mean value, and then
calculates R0, Rl and the average RaV of the energies of the previous K non-silence coding
frames. K is fixed to 50 for the first 50 non-silence coding frames, increases from 50 to 100
with the next 50 non-silence coding frarnes, and then remains constant at the value of 100.
The flowchart of the procedure that calculates Ray7 Rl, R0 and updates the Ra7, buffer is
shown in Figure 16, where "Count" ~~pl~scnl~. the nurnber of non-silence speech frarnes, and
"~t" denotes increase by one. Notice that TH is an adaptive threshold that is representative
of a silence (non speech) frarne and is defined as in Figure 17. CR in this case is eqtral to
CRm~x
MA-I
Given R0, R1, RaV and CRl~;X,, the VIUV part of Process I calculates the status VMn~, of the
n~1 frarne. The flow~ of this part of the algorithrn is shown in Figure 18 where "V"
r~ c.se.~(s the output V/UV flag of this procedure. Setting the "V" flag to I or 0 in~ trs
voiced or unvoiced classification respectively. The "CR" parameter denotes the maxi.~lw~l
value of the CR function which is calculated in the pitch estimation process. A diagr~mm~+ic
.e~ e.-t..l inn of the voiced/unvoiced procedure is given in Figure 19.
Having the VMn+, value, the PMn+, estimate and the V'l, and P'n f ~tim~t~S which have been
produced from Process I oy.,~a~ g on the previous nth coding frame, as illustrated in Figure
7, part b, two fi~rther locations Mn+,~dl and Mn+,~d2 are estim,.ted and the corresponding
~-147,147~ segments of filtered speech samples are obtained as illustrated in Figure 7, part b.
These additional t~,vo analysis frames are used as input to the "Pitch Fssim,.tion process" of

CA 02259374 1998-12-29
, .
W O 98/01848 PCT/GB97/Ol
46
Figure 8 to yield PMn+l+dl and PM,~I+d2 The procedure for calculating dl and d2 is given in the
flowchart of Figure 20.
The final step in part (a) of Process I of Figure 7, evolves the previous V/UV classification
procedure of Figure 8 with inputs R0, ~1, ~v, and
CP = max[CR"~',X" CRM"." di ~ CRMr~,+-12 ]
to yield a pre~ y value V,~
In addition, a multipoint pitch çstim~tion algorithm accepts PMn~" PMn~+dl~ PMn~(d2~ Vn-l,
Pn l, V'n, P~n to provide a prelimin~ry pitch value pt,l,rl The flowchart of this multipoint pitch
e~ n1ion algo~ l is given in Figure 21, where P~, P2 and PO l~l,Lesent the pitch estimates
associated with the Mn~ltd" Mn+l +d2 and Mn+~ points respectively, and P denotes the output
pitch ~ le of the process, that is Pnt l .
Firlally part (b) Process I of Figure 7 imposes constraints on theV"Prl and p~lfr~ estirnates in
order to ensure a smooth evolution for the pitch parameter. The flowchart of this section is
given in Figure 22. At the start of this process "V" and "P" .~ sent the voicing flag and
pitch e.~ttmZlt5 values before constraints are applied (V""r, and P"~', in Figure 7) whereas at
the end of the process "V" and "P" lc~lese,lt the voicing flag and pitch estimate values after
the con~ . have been applied (V"',l and ~,'f,). The V'n+l and Pln+~ produced from this
section are then used in the next pitch past processing section together with Vn l7 V'n, Pn l and
P'n to yield the final voiced/unvoiced and pitch estimate parameters Vn and Pn for the nth
coding frame. ~is pitch post processing stage is defined in the flowchart of Figures 23, 24
and 2~, the output A of Figure 23 being the input to Figure 24, and the output B of Figure 24
being the input to Figure 25. At the start of this procedure ~Pn~ and "Vn'7 represent the pitch
e~timSlte and voicing flag respectively, which correspond to the nth coding frarne prior to post
L~luC~..Siilg ~i.e. P"', V"' ) whereas at the end of the procedure ~Pn~ and ~Vn~ esent the
final pitch ~stim~te and voicing flag associated with the nth frame (i e. Pl~, Vn).

CA 02259374 1998-12-29
WO 98/01848 PCT/GB97/01
47
The LPC analysis proeess (Process II of Figure I ) can be performed using the
Autocorrelation, Stabilised Covariance or Lattice methods- The Burg algorithrn was used,
although simple autocorrelation schemes could he employed without a noticeable effeet in the
decoded speech quality. The LPC coefficients are then transforrned to an LSP ~ s~ n
Typical values for the number of coefficients are 10 to 12 and a 10th order filter has been
used. LPC analysis processes are well known and deseribed in the literature, for example
"Digital Proeessing of Speech Signals", L.R. Rabiner and R.W. Schafer, Prentiee - Hall Ine.,
Englewood Cliffs, ~ew Jersey, 1978. Similarly, LSP rep~se,~ ions are well known, for
exarnple from "Line Spectrum Pair and Speech Data Compression", F Soong and B.H. Juang,
Proe. ICASSP-84, pp 1.10.1-1.10.4, 1984. Accordingly these processes and Le~ on~
will not be deseribed further in this doenm~nt
In proeess II, ten LSP coefficients are used to ,~pLesellt the data. These 10 coeffieients eould
be qn~nti7e~l using sealar 37 bits with the following bit allocation pattern [3.4,4,4,4,4,4,4,3,3].
This is a relatively simple process, but the resulting bit rate of 1850 bits/second is
nnecçsszlrily high. ~l~ ivc:ly the LSP coefficients ean be Veetor Qn~ntice~l (VQ) using a
Split-VQ technique. In the Split-VQ technigue an LSP p~a..lclel vector of dimension "p" is
split into two or more subvectors of lower dimensions and then, each subveetor is Vector
Qll~ntice-l separately (when Veetor Qn~nticin~ the subveetors a direct VQ approaeh is used).
In effeet, the LSP transforrned coeffieient vector, C, which consists of "p" consecutive
eoefficients (el,c2,...,cp) is split into "K" vectors, Ck (lskSK), with the corresponding
imt?n~ions dk (lSdkSp). p=dl+d2+...~dk. In particular, when "K" is set to "p'' (i.e. when C
is partitioned into "p" elements) the Split-VQ becomes equivalent to Scalar On~nti~tion. On
tne other hand, when K is set to unity (K=l, dk=p) the Split-VQ becomes equivalent to Full
Searc~ VQ.
,

CA 02259374 1998-12-29
PCT/GB97/OlUI
W O 98/01848
48
The above Split VQ approach leads to an LPC filter bit rate of the order of 1.3 to
1.4Kbits/sec. In order to minimi7e further the bit rate of the voice coded system described in
this docllm~nt a Split Matrix VQ (SMQ) has been developed in the University of l\J~nch~st.or
and reported in '~Efficient Coding of LSP Parameters using Split Matrix Qlls7T~ s7tion~,
C.Xydeas and C.Pal ~n~Sts~iou, Proc ICASSP-95, pp 74()-743, 1995. This method results in
transparent LPC ql-~ntir~tion at 900bits/sec and offers a flexible way to obtain, for a given
q~ nti~s~.~iOn accuracy, the required memory/complexity characteristics for Process III. An
important feature of SMQ is a new weighted Euclidean distance which is defined in detail as
follows.
D~Lk(l),~'~(I))= ~ [~ " --LSP' )~ (s,~)'w,(t)'] .~29)
where L'k (I) represents the kth (k=l,...,K) q~ n~i7~rl submatrix and LSr~ (k ~ y ate its
eJ~n~n~ m(k) le~l~sc,.ll, the spectral dLmension of the kth submatrix and N is the SMQ
f~rne t~im~n.~ion. Note also that: S(k) = ~~ m(j), m(0) = 1 and ~, m(k) = p
j=O k=l
[ Aver(En)] ~ ) f )r transmi. sion trames 0 S t S N--I (30)
when the N LPC frames consist of both voiced and unvoiced frames
w,(t)= En(t)~ otherwise
where Er(t) is the nonnS~ eQ energy of the prediction error of the (l+t)th frame, En(t) is the
RMS value of the (l+t)tn speech frame and Aver(En) is the average RMS value of the N LPC
frames used in SMQ. The values of the constants a and a 1 are set to 0.2 and 0.15
l es~,ecLi lrely .
Also:
WJi(S,t) ¦LSP~kl)~r¦ (31)

CA 02259374 1998-12-29
PCT/GB97/01831
W O 98/01848
49
where P(~t~" " ) is the value of the power envelope spectlurn of the (l+t) speech frame at the
+s LSP~(k' 1)1 s frequency. ,B is equal to 0.15
The overall SMQ ql~nti~tion process that yields the qll~nti.ce-l LSP coefficients vectors l I
to l I~N-I for the 1 to I~N-1 analysis frarnes is shown in ~igure 26. Tllis figure also includes
the inverse process, which accepts tlle above l l+i vectors i=O,..,N-I and provides the
corresponding LPC coefficients vector a' to â'+N~' . The a' i i=O,..,N-I, coefficients
vectors are modified, prior to the LPC to LSP transformation. by a 10 Hz bandwidth
expansion as indicated in Figure 26. A 5Hz bandwidtlI expansion is also included in the
inverse qll~nti~Zltion process.
Process IV of Figure I will now be described. This process is concemed with the rnixed
voiced classification of harmonics. When the nth coding frame is classified as voiced, the
residual signal Rn(i) of lengtn 160 samples centred at the middle Mn Of the nth coding fiarne
and the pitch period Pn for that frame are used to deterrnine the strongly voiced
(hvj=l)/weakly voiced (hvj=0) classification associated with the jth harrnonic c~jn. The
flowcllaLL of Process I~l is given in Figure 27. The R" array of 160 samples is E~mmin~;
windowed and augmented to forrn a 512 size array, which is then FFT processed. The
x;~ and m;~ "l"" values MGRma", MGRmjn of the resulting 256 spectral m~nit11~1e
values are ~ terminp(l~ and a threshold TH0 is calculated. TH0 is then used to clip the
m~nitu~le spectrurn. The clipped MGR array is searched to define peaks MGR(P) satisfying:
MGR(P)>MGR(P+l) and MGR(P)>MGR(P-l)
For each peak, MGR(P), "supported" by the MGR(P~l ) and MGR(P-l ) values a second order
polynomial is fitted and the maximum point of this curve is accepted as MGR(P) with a
location loc(MGR~P~. Further constraints are then imposed on these magnitude peaks. In
particular peaks are rejected:

CA 02259374 1998-12-29 '~
PCT/GB9710~31
W 098101~48
a) if there are spectral peaks in the neighbourhood of loc(MGR(P)) (i.e in the range
(loc(MGR(P~-fo/2 to loc(MG~(P))+fo/2 where fo is the fundamental ~requency in Hz),
whose value is larger than 80% of MGR(P) or
b) if there are any spectral magnitudes in the same range whose value is larger than MGR(P)
After applying these two constraints tlle rem~ining spectral peaks are characterised as
"domin~nt" peaks. The ob~ective of the rcrn~ining part of the process is to examine if there is
a "dornin~nt" peak near a given harmonic jxc~, in which case the harrnonic is classified as
strongly voiced and hvj--l, otherwise hvj=0. In particular, two thresholds are defined as
follows:
THI=O.lSxfo. TH2=(1.5/Pn)xfo
with fo=(l/Pn)xfs and fs is the sarnpling frequency.
The .lifr~,~,nce (loc(MGRd(k~-loc(MG~d(k--l)) is compared to 1.5xfo+IX2, and if
larger a related ha~nonic is not associated with a "cl-~minzln~" peak and tlle corresponding
ç7~;fi~tic r- hv is zero (wealcly voiced). (loc(M~Rd ~k)) is the location of the kth do., .;. ,~
peak and lc=l,...,D where D is the number of dominant peaks. This procedure is described in
detail in Figure 28, in which it should be noted that the harmonic index j does not akvays
s~o~d to the m~nih~ spectrum peak index k, and loc(k) is the location of the kth
do...,..-,."1 peak, i.e. loc (MGRd(k)) = loc(K)
In order to minimi~e the bit rate associated with the tr~n~mi~ion of the hvj information, two
s~hPm~s have been employed which coarsely represent hv.
Sc*eme I
The :~yC~ is divided into bands of 500Hz each and a strongly voiced/weakly voiced flag
Bhv is assigned for each band. The first and last 500Hz bands i.e. 0 to S00 and 3500 to

CA 02259374 1998 -12 - 29
PCI /GB97/0 1831
. , WO 98/01848
4000H~ are always regarded as strongly voiced (Bhv= I ) and weakly voiced (Bhv=0)
respectively. When Vn=1 and Vn l=l the 500 to lO00 Hz band is classified as voiced i.e.
Bhv=l. Furthermore, when Vn=l and V,~ 1=~ the 3000 to 3500 Hz balld is classified as
weakly voiced i.e. Bhv=0. The Bhv values of the r( m~ininE S bands are determined using a
majority decision rule on the hvj values of the j llarmonics which fall withill the band under
consideration. When the number of harmonics for a given band is even and no clear majority
can be established i.e. the nurnber of h~nnonics with hvj=l is equal to the nurnber of
h~rrn~nics with hvj=0, then the value of Bhv for that band is set to the opposite of the value
assigned to the imm~ t~ly preceding band. At the decoding process the hvj of a specific
h~rrnonic j is equal to the Bhv value of the corresponding band. Thus the hY information may
be t~ with S bits.
Sckeme n
In this case the 680 Hz to 3400 Hz range is represented by only two variable size bands.
When Vn=l and Vn l=0 the Fc frequency that separates these two bands can be one of the
following:
(~) 680, 1360, 2040, 2720.
whereas, when Vn=l and ~n l=l, Fc can be one of the following frequencies:
(B) 1360, 2040, 2720, 3400.
Fur~h~nnore, the 0 to 680 and 3400 to 4000 Hz bands are always represented with Bhv= I arld
Bhv=0 respectively. The Fc frequency is selected by e~c~mining the three bands sequentially
defined by the frequencies in (A) or (B) and by using again a majority rule on the harmonics
which fall within a band. When a band with a mixed voiced classification Bhv=0 is found, i.e.
the n~nber of harmonics with hvj=0 is larger than to the number of harmonics with hvj=l,
then Fc is set to the lower boundary of this band and the r~m~inin~ spectral region is
~ classified as Bhv=0. In this case only 2 bits are allocated to define Fc. The lower band is
strongly voiced with Bhv=l, whereas the higher band is weakly voiced with Bhv=0

CA 02259374 1998 -12 - 29
PCTtGB97/0 183 1
WO 98/01848
To illustrate the effect of the mixed voice classification on the speech synt~si~e-l from the
tr~n~mitt~cl information, Figures 29 and 30 represent respectively an original speech
waveform obtained for the utterance "Industrial sllares were mostly a" and frequency tracks
obtained for that utterance. The horizontal axis represents time in terms of frames each of
20msec duration. Figure 31 shows to a larger scale a section of Figure 30, and represents
frequency tracks by full lines for the case when the voiced frames are all deemed to be
strongly voiced (hv=l ) and by dashed lines when the strongly/wealcly voiced classification is
taken into account so as to introduce random perturbations when hv=0.
Figure 32 shows four waveforms A, 13, C and D. Waveform A ~ t;se~lts the magnitude
spectrum of a speech segment and the corresponding ~P~ spectral envelope (loglO domain).
Waveforrns B, C and D l~.es~ t the normalised Short-Terrn m~nit~lrle :j~ct;llull~ of the
corresponding residual segment (B), the excitation segment obtained using the binary
(voiced/unvoiced) excitation model (C), and the excitation se~mt-nt obtained using the
strongly voiced/weakly voiced/unvoiced hybrid excitation model (D). It will be noted that
the hybrid model introduces an ~p~ .iate amount of randomness where required in the 3~/4
to ~ range such that curve D is a much closer approximation to curve B than curve C
Process V of Figure l will now be described. Once the residual signal has been derived, a
cegm~nt of Pn samples is obtained in the residual signal (lom~in The magnitude spectrum of
the segment, which contains excitation source information, is derived by applying a Pn points
DFT. An ~1t~ iv~: solution, in order to avoid the computational complexity of the P" points
DFT, is to apply a fix length FFT (128 points) and to find the value of the magnitude
spectrum at the desired points, using linear interpolation.
For a real-valued sequence x(i) ~f P points ~lle DFT may be expressed as:

CA 02259374 1998-12-29
PCT/GB97/0~1
W 098/01~48
XO = ~ xfi) cosf p ) - j ~ x(i) sin( p
/eO ~--O
The Pn point DFT will yield a double-side spectrum. Thus. in order to represent the excitation
signal as a superposition of sinusoidal signals, the magnitude of all the non DC components
must be multiplied by a factor of 2 The total number of single side magnitude spectrurn
values, which are used in the reconstruction process, is equal to r(P + 1) / 2~
Process VI of Figure 1 will now be described. The DFT (Process V) applied on the Pn
sarnples of a pitch segment in the residual domain, yields r(P. ~ 1) / 21 spectral m~gn~ s
(MGjn, O<j<r(Pa + 1) / 21) and r(~, + 1) / 21 phase values. The phase inforrnation is
neglected. However, the continuity of the phase between ~ cPnt voiced frames is p.~s~ ved.
Moreover, the contribution of the DC magnitude component is assumed to be negligible and
thus, MGo is set to 0. In this way, the non-DC m~gnitucle ~IJect~ is assumed to contain all
the ~ L~lally important inform~tion
Based on the as~ Lion of an "~plo~ ately" flat shape m~nit~ spectrum for the pitch
residual segmerlt, various methods could be used to re~l~se-lL the entire magnitude spectrùm
with a single value. Specifically, a modified single value spectral amplitude ~pieselltation
(MSVSAR) technique is described below
MSVSAR is based on the observation that some of the speech spectrum resonance and anti-
reson~nce ;nform~1ton is also present at the residual magnitude spectrum (G.S. Kang and S.S.
Everett, "Improvement of the Excitation Source in the Narrow-Land Linear Prediction
Vocoder", IEEE Trans. Acoust., Speech and Signal Proc., Vol. ASSP-33, pp.377-386, 1985).
LPC inverse filt~ring can not produce a residual signal of absolutely flat magnitude ~-~ecL~
- mainly due to: a) the "cascade Ic~-~,selltation" of formats by the LPC filter l/A(z), which
results in the magnitudes of the resonant peaks to be dependent upon the pole locations of the
l/A(z) all-pole filter and b) the LPC qll~ntic~ion noise. As a consequence, the LPC residual

CA 022S9374 1998-12-29
W O 98/01848 PCT/GB97/01$31
signal is itself highly intelligible. Based on this observation the MGj magnitudes are
obtained by spectral sarnpling at the harmonic locations. co~ (P" + 1)/ 2~, of a
modified LPC synthesis filter, that is defined as follows:
GN (32)
1 - GR ~ U~'Z~~
where, â,', i=l,...,p replesellt the p q~ ti~e<l LPC coefficients of the nth coding frarne and
GR and GN are defined as follows:
GR = GR ~IrI (I--~, ) (333
and
1 2/~-l
_ ~x~ )2
G~ = r(P~ J2~-1 n i~O 2 (34)
(MP(~3 J )H(~ J )) / 2
where Kjn, i=l,...,p are the reflection coefficients of the nth coding frame, x,lr"'(i) represents a
sequence of 2Pn speech samples centred in the middle of the nth coding frame from which the
mean value is being calculated and rernoved, MP(~ ~' ) and ~ e~e~1L the frequency
response of the MP(z) and 1/A(z) filters respectively at the C~)j" frequency. Notice that the
MP(C~,.) values are c~lc~ ted assuming GN=I The GR parameter represents a const~nt
whose value is set to 0.25.
Equation 32 defines a modified LPC syntllesis filter with reduced feedbark gain, whose
frequency response consists of nearly equalised resonant peaks, the locations of which are
very close to the LPC synthesis resonant locations. Furthermore, the value of the fee~ibacl~
gain GR is controlled by the performance of the LPC model (i e. it is pmportional to the
no~ e~1 LPC prediction error). In addition Equation 34 ensures that the energy of the
reproduced speech signal is equal to the energy of the original speech waveform. Robustness
is increased by CO~ u~ g the speech RMS value over two pitch periods.

-
CA 02259374 1998-12-29
PCT/GB97/0~1
WO 98/01848
Two alternative m~nit~ e spectrum representation techniques are described below, which
allow for better coding of the magnitude information and lead to a significant improvement in
reconstructed speech quality.
The first of t'ne altemative magnitude spectrum leplGselltations teclmiques is referred to
below in the "Na amplitude system". The basic principle of this MG,' q11~7nti~tion system is
to ~ esc,lL ~ccurately those MGjn values which correspond to tne Na largest speech Short
Term (ST) spectral envelope values. ln particular, given the LPC coefficients of the nth
coding frarne, the ST magnitude spectrum envelope is calculated (i.e. sampled) at the
harmonic frequencies ~, and the locations lc(j), j=l,...,Na of the largest Na spectral samples
are detPrmined. These locations indicate effectively which of the r(P + 1) /21--1 MG,
magnitudes are subjectively more important for ,.~c11~f~ qtl,-nti7~tion The system
subsequently selects MGjn j=lc(1),...,lc(Na) and Vector Q11.n1i7Ps these values. If the
..,i.~i",~ pitch value is 17 samples, the number of non-DC MG'; amplitudes is equal to 8
and for tnis reason Na~8.. Two variations of the "Na-arnplitudes system" were developed with
equivalent pcl~o~ allce and their block diagrams are depicted in Figure 33 (a) and (b)
respectively.
i) Na-~mplit~ s system with Mean Norrn~ii7s~tion Factor. In this variation, a pitch se~mPnt
of Pn residual sarnples Rn(i), centered about tlle middle M" of the nth coding frame is
obtained and DFT transformed. The mean value of the spectral magnitudes MG',, j=l,....
I (Pn + 1~ / 2~ is calculated as:
~ 1-
~ MGjn
m= j~~ (35)
P" +l -I

CA 02259374 1998 - i2 - 29
'
PCI~/GB97/018:~1
WO 98/011~48
56
m is quantized and then used as the norrrtz~li7z7tion factor of the Na selected amplitudes MGf',
j=lc(l),...,lc(Na). The resulting Na arnplitudes are then vector quantized to MG, .
ii) Na-arnplitudes system with RMS Norlllzlli7z7~ion Factor. In this variation the RMS value
of the pitch segment centered about the middle M" of the nth coding frarne, is calculated as:
p ~R"(i~2
g ,. i~o (36)
2 2
g is 4~1z ..1;~ d and then used as the norrnz~i7~ion factor of the Na selected amplitudes MG,.,
j=lc(l~,...,lc(Na). These norrnzlli7~d amplitudes are then Vector Qllzlnticed to MG;'. Notice
that the Pn points DFT operation can be avoided in this case, since the magnitude s~e~ l of
tlle pitch segrnent is calculated only at the Na selected harrnonic frequencies ~ "
J----Ic(l),...,lc(Na).
In both cases the qllz nticzttinn of the m and g factors, used to norrnalize the MG;' values, is
performed using an adaptive ,u-law quantiser with a non-linear characteristic as:
log~ (1 + ~1 AI / Am~x ) sgn~ ~) with l1=2SS
This arrangement for the qltztnti7~tion of g or m extends the dynarnic range of the coder to not
less than 25dBs.
At the receiver end the decoder recovers the MG;' magnitudes as MG;.' = MG;" x Aj=lc(l),...,lc(Na). The remzinin~; r(P" +1)/21- Na--I MG,' values are set to a constant
value A. (where ~ is either "m" or "g"). The block diagram of the adaptive ~L-Iaw quantiser is
shown in Figure 34.

CA 02259374 1998-12-29
PCT/GB97/0~1
W O 98/01848
The second of the altemative magnitude spectrum le~lcsentation techniques is referred to
below as the "Variable Size ~pectral Vector Q~l~ntic~fion (VS/SVQ)' system. Coding
systems, which employ the general synthesis forrnula of Equation (I) to recover speech,
encounter the problem of coding a variable length, pitch dependant spectral amplitude vector
MG. The "Na- arnplitudes" MGJ' qll~ntic~tion schemes described in Figure 33 avoid this
problem by Vector Qll~nticin~ the minimum expected nurnber of spectral amplitudes and by
setting the rest of the MGJ' amplitudes to a fixed value. However, such a partially spectrally
flat excitation model has limitations in providing high recovered speech quality. Thus, in
order to improve the output speech quality, the shape of the entire { MG;' } magnitude
spectrum should be qll~ntice~ Various techniques have been proposed for coding { MG;' }.
Originally ADPCM has been used across the MG,' values associated to a specific coding
frame. Also ~ MG, } has been DCT transformed and coded differentially across successive
MGJ m~ de spectra. However, these coding schemes are rather inefficient and operate
with relatively high bit rates. The introduction of Vector Qll~ntic~til n on the ~ MG;' } spectral
amplitude vectors allowed for the development of Sinusoidal and Prototype Interpolation
systems which operate at around 2.4 Kbits/sec Two known ~ MGJ. } VQ methods are
described below which quantise a variable size ~vsn) input vector with a fixed size (fxs)
codevector.
i) The first VQ method involves the transforrnation of the input vector to a fixed size vector
followed by conventional Vector Q~nti.c~tion. ~he inverse transformation on the quantised
fixed si~ vector yields the recovered quantised MG" vector. Transformation techniques
which have been used include, Linear Interpolation, Band Limited Interpolation, All Pole
modelling and Non-Square transforrnation. However, the overall distortion produced by this
approach is the sunrnation of the VQ noise and a component, which is introduced by the
transformation process.
-

CA 02259374 1998-12-29
W O 98/01848 PCT/GB971018~1
58
ii) The second VQ method achieves the direct ql-~n~ tion of a variable input vector with a
fixed size code vector. This is based in selecting only vs,~ elements from each codebook
vector, to forrn a distortion measure between a codebook vector and an input MG" vector.
Such a q~nti~fion approach avoids the transformation distortion of the previous techniques
mentioned in (i) and results in an overall distortion that is equal to the Vector Qll~nti~tion
noise.
An improved VQ method will now be described which is referred to below as the Variable
Size Spectral Vector Qll~nti~tion (VS/SVQ) scheme This scheme was developed to take
advantage of the underlying principle that the actual shape of the { MGj' } mslgnit~lde
sl~c~ Ll is defined by a minimurn r(P" ~ 1) / 21 of equally spaced samples. If we consider
t_e ~ x;~ l expected pitch estim~t~ P"l;,X, then any I MG,' ' spectral shape can be
se~Led adequately by r(~, + 1) / 21 sarnples. This suggests that the fixed size fxs of the
codebook vectors S' repr~nting theMG, shapes should not be larger thanr(~, + 1) / 21.
Of course this also implies that given the r(P" + 1) / 21 samples of a codebook vector, the
complete spectral shape, defined at any frequency, is obtained via an interpolation process.
Figure 35 hi~hli~ht~ the VS/SVQ process. The codeboolc CBS having cbs fixed fxs
~lim~n~ion vectors SJ, j=l,...,fxs and i=l,...,cbs, where fxs isr(P" ~1) / 21, is used to quantise
an input vector MG,, j=l,...,vsn of dimension vsn. Interpolation (in this case linear) is used
on the Si vectors to yield S" vectors of fiin~erl~ion vsn. The Si to S" interpolation
process is given by:
s ~ s (li~s) +(i~s ~ SD X si(~ sS ) -S ( iv~s; ) (38)
vs" vs" vs" ,~s _ , lScs
vs" v*"
for i=l,...,cbs and j=l,...,vsn
_ ,

CA 02259374 1998-12-29
W O98tO1848 PCT/GB97/Ol~l
59
This process effectively defines S'' spectral shapes at the c)', frequencies of the MG;'
vector. A distortion measure D(S'',MG") is then deflned between the S" and MG"
vectors, and the codebook vector Sl that yields the minimum distortion is selected and its
index I is transmitted. Of course in the receiver, Equation (38) is used to define MG" from
S
If we assurne that PmaX~120 then fxs=60. However this value can be reduced to 50 without
significant degradation by low pass filtering the signal synthesised from Equation (13. This is
achieved by setting to zero all the harmonics MG;' in the region of 3.4 to 4.0KHz, in which
case:
~3400x Pnl if vsns5o (39)
SO=vsn otherwise.
and vsn<fxs.
Amplitude vectors, obtained from adjacent residual frarnes, exhibit xignific~nt re~ nrl~ncy,
which can be removed by means of backward prediction. Prediction is performed on a
harrnonic basis i.e. the arnplitude value of each harmonic MG;" is predicted from the
amplitude value of the same harmonic in previous frarnes i.e. MG,'- ' . ~ fixed linear predictor
MG = b X MG may be incorporated in the VS/SVQ system, and the resulting DPCM
is shown in Figure 36 (differential VS/SVQ, (DVS/SVQ?). In particular, error
vectors are forrned as the difference between the original spectral arnplitudes MGj" and their
predicted ones MGJ, i.e.:
EJ, = MG; _ MG/~ for~ VSn
where the predicted spectral amplitudes MG,' are given as:
bxMGJ'-' when V",-I
MG;' = < , for l~<vsn l (40)
O w*en Y" = O
and

CA 02259374 1998-12-29
W O98101848 PCT/GB97/0~1
for vsn l<iSVSn (41)
vs", ~.1
Furtherrnore the quantised spectral amplitudes MG', are given as:
M~jn ~ E,.' for I S j < vs"
MG,. = ~ (42)
1 ~ M&,~' for v*" < j < Ir 2
where ~in denotes the ~ Lised error vector.
The qllzinti~zition of the ~jn l<jsvsn error vector incorporates Mean Removal and Gain Shape
Quzinti~tion techniques, using the hierarchical VQ structure of Figure 36.
A weighted Mean Square Error is used in the VS/SVQ stage of the system. The weiEh~;nE
function is defined as the frequency response of the filter: W(z) = I / A" (z / y ), where An(z) is
the short-term linear prediction filter and y is a constant, defined as y=0.93. Such a weighting
function that is ~.o~olLional to the short-term envelope 7~c~,L~lll, results in s~lhstzmtizilly
Vt~d decoded speech quality. The weighting function W;' is norrnziiice~i so that:
~YJ = 1 ~43)
i~l
The pdf of the mean value of E" is very broad and, as a result, the mean value differs widely
from one vector to another. This mean value can be regarded as statistically independent of
the variation of the shape of the error vector E'' and thus, can be quantised separately
without paying a substzinS;sil penalty in compression efficiency. The mean value of an error
vector is czilclll~te~l as follows:
M = ~,~ Win x E~Jn (44)
~1
M is Optimum Scalar Q-lsinti~ed to M and is then removed from the original error vector to
forrn ~rmn ~ M). The overall qilsinti7,ition distortion is attributed to the q~Isin1i7zition

CA 02259374 1998-12-29
PCTIGB97/0 ~ 1
'. W O 98101~48
61
ofthe "Mean Removed" error vectors (Erm" ), which is perforrned by a Gain-Shape Vector
Quantiser.
.
The objective of the Gain-Shape VQ process is to determine the gain value C and the shape
vector S so as to minimi~e the distortion measure:
D(~rm" G x ~ - ~, WJ' [Frm'i - G x S~
A gain optimised VQ search method, similar to techniques used in CELP systems, is
employed to find the optimum G c~nd S. The shape Codebook (CBS) of vectors S' is
searched first to yield an index 1, which maximises the quantity:
.. ~;, 2
~W; ~rm,'(i)SJ.
Q~i) = ' ~ for i=l,. .. cbs (46
~7,,S,,2
~'1
where cbs is the number of codevectors in tl~e CBS. The optimum gain value is defined as:
~.r"
~ Y~ Erm;'S; '
G= v.~,, (47)
~wns~l2
~-1
and is O~tU11U111 Scalar Quantised to G .
During shape qu~nti~tion the principles of VS/SV(2 are employed, in the sense that the S' ',
vsn size vectors are produced using Linear Interpolation on fxs size codevectors S' . Both
trained and randomly generated shape CBS codebooks were investigated Although Ermn
has noise-lilce characteristics, systems using randomly ~enerated shape codebooks resulted in
unsatisfactory muffled decoded speech and were inferior to systems employing trained shape
codebooks.

CA 02259374 1998-12-29
PCT/GB97/01
W O 98/01~48
62
A closed-loop joint predictor and VQ design process was employed to design the CBS
codebook, the optimurn scalar quantisers CBM and CBG of the mean M and gain G values
respectively, and also to define the prediction coeff~cient b of Figure 36. In particular, the
following steps take place in the design process.
STFP A0 (k=0). Given a training sequence of MG;" the predictor b~ is calculated in an open
loop fashion (i.e. MGJ' = b x MG;'-' for l<j<r(P" + 1) / 21 when Vn l-l, or
MG) = 0 elsewhere). Furtherrnore, the CBM0 mean, CBG0 gain and CBS0 shape
codebooks are ~le~ignecl independently and again in an open loop fashion using
un~lu,~ ed En, In particular:
a) Given a training sequence of error vectors E'l ~, the mean value of each En o is
calculated and used in the training process of an Optimurn Scalar Quaritiser
(CBM0)
b) Given a ~ining sequence of error vectors En o and the CBM0 mean q~l~nti~er, the
mean value of each error vector is calculated, qll~nti~ed using the CBM0 quantiser
and removed from the original error vectors En o to yield a sequence of "Mean
Removed" training vectors Errnn (~
c) Given a training sequence of Erm" ~ vectors, each "Mean Removed" training vector
is norm~iised to unit power (i.e. is divided by the factor G = ~ ~Wj'(Erm,' ~ ),
linear interpolated to fxs points, and then used in the training process of a
conventional Vector Quantiser of fxs tlimen~ion~ (CBS0).
d! Given a training sequence of Erm'l ~ vectors and the CBS0 shape codebook, each
"Mean Removed" training vector is encoded using Equations 46 and 47 and the
value G of Equation 47 is used in the training process of an Optimurn Scalar
Quantiser (CBG0).
kissettol(k=l).

CA 02259374 1998-12-29
' PCT/GB97/0 ~ 1
W O 98/01848
63
STEP A1 Given a training sequence of MGj and the mean, gain and shape codebooks of the
previous k-l iterations (i.e. CBMI~-l, CBGk-l~ CBSk-l)~ tlle optimum prediction
coefficient bk is calculated.
STEP A2 Given a training sequence of MGj, an optimum prediction coefficient bk and
CBMk-l, CBGk-l~ CBSIc-l, a training sequence of error vectors Enk is formed,
which is then used for the design of new mean, gain and shape codebooks (i.e. CBMk,
(:~BGk. CBSk)
STI=P A3 The performance of the kth iteration ql~nti7~tion system (i.e. bl~. CBMk, CBGk,
CBSk~ is evaluated and compared against the q-7~nti7~tion system of the previousiteration (i.e. bk-l, CBMk-l, CBGk-l~ CBSk-l). If the q~nti7zltion distortion
converges to a minimum, the qll~nti7~tion design process stops. Otherwise, k----lc+l
and steps A1, A2 and A3 are repeated.
The performance of each quantizer (i.e. b~, CBMk, CBGk~ CBSIC) has been evaluated using
subjective tests and a LogSegSNR distortion measure, which was found to reflect the
subjective p~lr,~ ce of the system.
The design for the Mean-Shape-Gain Quantiser used in STEP A2 is performed using the
following two steps:
ST~P B1 Given a training sequence of error vectors E" ~, the mean value of each E'lkis
calculated and used in the training process of an Optimum Scalar Quantiser (CBMk).
STEP B2 Given a tra~ning sequence of error vectors Enk and the CBMk mean q-l~nti7to.r, the
mean value of each residual vector is calculated, ql-~nti7~d and removed from the
original residual vectors En k to yield a sequence of "Mean Removed" training
vectors Ermn k, which are then used as the training data in the design of an optimum
Gain Shape Qllz~nti7.o.r (CBGk and CBSk). This involves steps Cl - C4 below. (l~e
q~ ;on design process is perforrned under the assumption o~ any independent

CA 02259374 1998-12-29
PCTIGB97/Olg~l
W O 98/01~48
~4
gain shape quantiser structure, i.e. an input error vector ~m~ can be le~l~sent~d by
any possible combination of ~ii codebook shape vectors and G gain ql-~nti7~r levels.)
STEP Cl (v=O). Given a training sequence of vectors Ermn k and an initial CBGk-~ and
CBS~-~ gain and shape codebooks respectively, compute the overall average distortion
distance Dk.o as in Equation 44. Set v equal to 1 (v= I ).
STEP C2 Given a ~;nin~ sequence of vectors Erm" I; and the CBGk-V-I gain codebook from
the previous iteration, compute the new shape codebook CBSk-Y which minimi~es the
VQ distortion measure. Notice that tlle optimum CBSk V shape codebook is obtained
when the distortion measure of Equation (44) is a minimurn and this is achieved in
Ml k V iterations.
STEP C3 Given a training sequence of vectors Erm'l ~ and the CBSI~v shape codebook,
compl-te a new gain quantiser CBGk-V, which minimi.ce the distortion measure of
Equation (44). This optimum CBGk'V gain ql-~nti~e~ is obtained when the distortion
measure of Equation (44) is a minimum and this is achieved in M2k v iterations.
STEP C4 Given a tra~ing sequence of vectors Erm'1 k and the shape and gain codebooks
CBSk-Y and CBGk-Y, compute the average overall distortion measure. If (Dk v-l~
Dk.V)/Dk V<~ stop. Otherwise, v=v+I and go back to STEP C2.
The centroids s,k"Y~ ,cbs and u=l, .,fxs of the shape Codebook CBSk-V-"l, are updated
during the mth iteration perforrned in STEP C2 (m= I ,...,M 1~; v) as follows:
(NC" j " + C" j "NC" ,." )
S~ ~r = C N:Er r~)l (48)
(DC" j " + C" j "DC~" j "
~:E~rrr ' ~)1
where DCj~ =Wj~(Gkr-l x f";")
NC" j n = WJr G~ V-' f" j J~ (Erm j' --G~ ' S,'',;' (""~' n

CA 02259374 1998-12-29
PCT/GB97/Ol
W O 98/~1848
fu,~ S ul,
~ 1 if fN, jJl < 1
~J~ lo if fllJ~ll>
I y ljfXs-ul+ls~s
vsll vsll
c~
o if ~jfxs-ul+l~fxs
u+ l iS U~ s
u"~u, j, n) = ~ vs" and
u--1 if u> j~
vs"
j-l if u<j~S
j + 1 y U > j ~
vs"
Q, denotes the cluster of Ermn k error vectors which are quantised to the S~ r-m-~ codebook
shape vector, cbs le~esell~ the total n unber of shape ql1~nti~tion levels, Jn represents the
CBGk-V-I gain codebook index which encodes the Erm'l ~ error vector auld ISjsvs".
The gain centroids, Gl Vm, i=l,...,cbg of the CBG~:-V-~n gain quantiser, which are computed
during the mth iteration in STE~P C3 (m=l,...,M2k v), are given as:

CA 02259374 l998-l2-29
P~: l 1~97/0 1~3 1 '
WO98/01848
66
(~ Erm~S~ ~ W~)
G' ~ ~ ,t E,~ .D, i 1 ~ (49)
(~(S,~ Y) W,.~)
where Dj denotes the cluster of Ermn k error vectors which are qtl~nticed to the Gk-V-n'-' gain
quantiser level, cbg ~lest;llts the total number of gaLTl qu~ntis~tion levels, In le~senl~ the
C~3Sk-Y shape codebook index which encodes the Errnn ~ error vector and l<j<vsn.
The above employed design process is applied to obtain the optimum shape codebook CBS,
optirnurn gain and mean qu:~n~i7ers7 CBG and C~BM and the optimum prediction coe~lcient b
which was fin~lly set to b=0 35.
Process VII calculates the energy of the residual signal. The LPC analysis performed in
Process II provides the prediction coefficients a; I<i<p and the reflection coefficients k
p. On the other hand, the Voiced/Unvoiced classification perforrned in Process Iprovides the short term autocorrelation coeff1cient for zero delay of the speech signal (R0) for
the frame under consideration. Hence~ the Energy of the residual signal E" value is given as:
En = MRO~ K,)2 (50)
The above e~ ion represents the minimum prediction error as it is obtained from the
I,inear Prediction process. However, because of qll~nliY~tion distortion the pararneters of the
l,PC filter used in the coding-decoding process are slightly different from the ones that
achieve minim-lm prediction error. Thus, Equation (50) gives a good apl)lo~ ation of the
residual si~nal energy v~ith low computational requirements The accurate En value can be
given as:

CA 02259374 1998-12-29
PCT/GB97/018~1
' '. WO 98/01848
67
M-l
E" = M~R (i) (51)
The res--ltin~ is then Scalar Quantised USillg an adaptive ~-law qll~ntice~l arrangement
similar to the one depicted in Figure 34 In the case where more than one ~/~ are used in
the system i.e. the energy En is calculated for a number of subframes then ~" ~ is given by the
general equation:
Ms -I
E"~, = M ~ R ~i + ~tM.r ~ O S ~ 5 ~ (52)
Notice that when - = L Ms-M and for ~ = 4, M~=M/4.

Representative Drawing
A single figure which represents the drawing illustrating the invention.
Administrative Status

2024-08-01:As part of the Next Generation Patents (NGP) transition, the Canadian Patents Database (CPD) now contains a more detailed Event History, which replicates the Event Log of our new back-office solution.

Please note that "Inactive:" events refers to events no longer in use in our new back-office solution.

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Event History , Maintenance Fee  and Payment History  should be consulted.

Event History

Description Date
Inactive: IPC expired 2013-01-01
Inactive: IPC expired 2013-01-01
Inactive: IPC expired 2013-01-01
Inactive: IPC deactivated 2011-07-29
Inactive: IPC deactivated 2011-07-29
Inactive: IPC deactivated 2011-07-29
Inactive: IPC from MCD 2006-03-12
Inactive: First IPC derived 2006-03-12
Inactive: IPC from MCD 2006-03-12
Inactive: IPC from MCD 2006-03-12
Application Not Reinstated by Deadline 2004-07-07
Time Limit for Reversal Expired 2004-07-07
Deemed Abandoned - Failure to Respond to Maintenance Fee Notice 2003-07-07
Letter Sent 2002-08-09
Inactive: Entity size changed 2002-07-12
Request for Examination Requirements Determined Compliant 2002-06-28
Request for Examination Received 2002-06-28
All Requirements for Examination Determined Compliant 2002-06-28
Letter Sent 2001-09-11
Reinstatement Requirements Deemed Compliant for All Abandonment Reasons 2001-08-21
Deemed Abandoned - Failure to Respond to Maintenance Fee Notice 2001-07-09
Letter Sent 1999-05-27
Inactive: Single transfer 1999-04-27
Inactive: IPC assigned 1999-03-09
Classification Modified 1999-03-09
Inactive: IPC assigned 1999-03-09
Inactive: First IPC assigned 1999-03-09
Inactive: IPC assigned 1999-03-09
Inactive: Courtesy letter - Evidence 1999-03-02
Inactive: Notice - National entry - No RFE 1999-02-24
Application Received - PCT 1999-02-22
Application Published (Open to Public Inspection) 1998-01-15

Abandonment History

Abandonment Date Reason Reinstatement Date
2003-07-07
2001-07-09

Maintenance Fee

The last payment was received on 2002-06-28

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

  • the reinstatement fee;
  • the late payment fee; or
  • additional fee to reverse deemed expiry.

Patent fees are adjusted on the 1st of January every year. The amounts above are the current amounts if received by December 31 of the current year.
Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Fee History

Fee Type Anniversary Year Due Date Paid Date
Basic national fee - small 1998-12-29
Registration of a document 1999-04-27
MF (application, 2nd anniv.) - small 02 1999-07-07 1999-06-30
MF (application, 3rd anniv.) - small 03 2000-07-07 2000-06-23
Reinstatement 2001-08-21
MF (application, 4th anniv.) - small 04 2001-07-09 2001-08-21
Request for examination - standard 2002-06-28
MF (application, 5th anniv.) - standard 05 2002-07-08 2002-06-28
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
THE VICTORIA UNIVERSITY OF MANCHESTER
Past Owners on Record
COSTAS XYDEAS
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Representative drawing 1999-03-29 1 9
Description 1998-12-28 67 2,346
Drawings 1998-12-28 27 579
Claims 1998-12-28 12 325
Abstract 1998-12-28 1 57
Reminder of maintenance fee due 1999-03-08 1 111
Notice of National Entry 1999-02-23 1 193
Courtesy - Certificate of registration (related document(s)) 1999-05-26 1 116
Courtesy - Abandonment Letter (Maintenance Fee) 2001-08-05 1 182
Notice of Reinstatement 2001-09-10 1 172
Reminder - Request for Examination 2002-03-10 1 119
Acknowledgement of Request for Examination 2002-08-08 1 193
Courtesy - Abandonment Letter (Maintenance Fee) 2003-08-03 1 176
PCT 1998-12-28 8 337
Correspondence 1999-03-01 1 29
Fees 2000-06-22 1 47