Language selection

Search

Patent 2026640 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent: (11) CA 2026640
(54) English Title: SPEECH ANALYSIS-SYNTHESIS METHOD AND APPARATUS THEREFOR
(54) French Title: METHODE ET APPAREIL D'ANALYSE-SYNTHESE DE PAROLES
Status: Deemed expired
Bibliographic Data
(52) Canadian Patent Classification (CPC):
  • 354/47
(51) International Patent Classification (IPC):
  • G10L 19/08 (2006.01)
  • G10L 19/12 (2006.01)
(72) Inventors :
  • HONDA, MASAAKI (Japan)
(73) Owners :
  • NIPPON TELEGRAPH & TELEPHONE CORPORATION (Japan)
(71) Applicants :
  • HONDA, MASAAKI (Japan)
(74) Agent: KIRBY EADES GALE BAKER
(74) Associate agent:
(45) Issued: 1996-07-09
(22) Filed Date: 1990-10-01
(41) Open to Public Inspection: 1991-04-03
Examination requested: 1990-10-01
Availability of licence: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): No

(30) Application Priority Data:
Application No. Country/Territory Date
257503/89 Japan 1989-10-02

Abstracts

English Abstract





An impulse sequence of a pitch frequency is
detected from a phase-equalized prediction residual of an
input speech signal, and a quasi-periodic impulse
sequence is obtained by processing the impulse sequence
so that a fluctuation in its pitch frequency is within an
allowed limit range. The magnitudes of the quasi-
periodic impulse sequence are so determined as to
minimize an error between the waveform of a synthesized
speech obtainable by exciting an all-pole filter with the
quasi-periodic impulse sequence and the waveform of a
phase-equalized speech obtainable by applying the input
speech signal to a phase equalizing filter. Preferably,
the quasi-periodic impulse sequence is supplied to the
all-pole filter after being applied to a zero filter in
which it is given features of the prediction residual of
the speech. Coefficients of the zero filter are also
determined so that the error of the waveforms of the
synthesized speech and the phase-equalized speech is
minimum.


Claims

Note: Claims are shown in the official language in which they were submitted.


-36-
Claims:

1. A method for analysing a speech to generate
parameters representing an input speech waveform
including parameters of an excitation signal for exciting
a linear filter representing a speech spectral envelope
characteristic, comprising:
a step wherein a phase-equalized prediction
residual of the input speech waveform is produced;
a step wherein reference time points where
levels of said phase-equalized prediction residual exceed
a predetermined threshold are determined;
a step wherein impulse positions are determined
based on the reference time points such that when
fluctuation of successive intervals of the reference time
points in each analysis window is within a predetermined
range, the reference time points are determined as
impulse positions, and when fluctuation of successive
intervals of the reference time points exceed the
predetermined range, impulse positions are determined by
moving or deleting the reference time points or inserting
reference time points to define a sequence of quasi-
periodic impulses at the determined impulse positions
having a pitch frequency of a limited fluctuation width,
the positions of said quasi-periodic impulse sequence
being one of the parameters representing said excitation
signal; and
a step wherein magnitudes of the respective
impulses of the quasi-periodic sequence in each analysis
window are so determined as to minimize an error between
the phase-equalized speech waveform and a synthesized
speech waveform obtained by exciting said linear filter
with said quasi-periodic impulse sequence, the magnitudes
of the quasi-periodic impulses being another one of the
parameters representing said excitation signal.

-37-
2. The method of claim 1, including a step wherein
before being applied to said linear filter, said quasi-
periodic impulses are subjected to processing by a zero
filter which characterizes a fine spectral structure of
said speech, and coefficients of said zero filter are so
determined as to minimize an error between said phase-
equalized speech waveform and a synthesized speech
waveform obtained by exciting said linear filter with the
output of said zero filter, said coefficients of said
zero filter being used as one of said parameters
representing said excitation signal.

3. The method of claim 1 or 2, wherein said
excitation signal is used for a voiced sound and a random
sequence selected from a plurality of predetermined
random patterns is used as an excitation signal for an
unvoiced sound, and including a step wherein one of said
predetermined random patterns representing said
excitation signal for said unvoiced sound is so selected
as to minimize an error between said phase-equalized
speech waveform and a synthesized speech waveform
obtainable by exciting said linear filter with said
random patterns and the selected one of the predetermined
random patterns is used to produce one of the parameters
representing the input speech waveform.
4. A speech analysing apparatus comprising:
linear predictive analysis means for performing
a linear predictive analysis of an input speech signal
for each analysis window of a fixed length to obtain
prediction coefficients;
inverse filter means controlled by said
prediction coefficients, for deriving a prediction
residual from said input speech signal;
speech phase equalizing filter means for
rendering the phase of said input speech signal into a
zero phase to obtain a phase-equalized speech signal;

-38-
prediction residual phase equalizing filter
means for rendering the phase of said prediction residual
into a zero phase to obtain a phase-equalized prediction
residual signal;
reference time point generating means for
detecting impulses of magnitudes larger than a
predetermined threshold value in said phase-equalized
prediction residual signal and for outputting the
positions of said impulses as reference time points;
impulse position generating means for
determining, based on said reference time positions,
positions of impulses having a pitch frequency of a
limited fluctuation width, the impulse positions being
one of the parameters representing the excitation signal;
impulse sequence generating means for
generating impulses at said impulse positions;
all-pole filter means controlled by said
prediction coefficients and excited by said generated
impulse sequence to generate a synthesized speech; and
impulse magnitude calculating means whereby
magnitude values of said impulses generated by said
impulse sequence generating means are so determined as to
minimize an error between a waveform of a synthesized
speech obtainable by exciting said all-pole filter means
with said impulse sequence and a waveform of said phase-
equalized speech supplied from said speech phase-
equalizing filter means, the impulse magnitudes being
another one of the parameters representing the excitation
signal.

5. The apparatus according to claim 4, further
comprising:
zero filter means supplied with said impulse
sequence, for providing said impulse sequence with
features of the waveform of said phase-equalized
prediction residual signal and supplying the output

-39-
thereof to said all-pole filter means as the excitation
signal; and
zero filter coefficient calculating means for
determining coefficients of said zero filter means so
that an error between a waveform of a synthesized speech
obtained by exciting said all-pole filter means with the
output of said zero filter means and a waveform of said
phase-equalized speech is minimized.

6. The apparatus of claim 5, wherein said impulse
sequence generating means includes:
impulse position generating means supplied with
said reference time points, whereby when fluctuation in
the intervals of said reference time points is within a
predetermined limit range, said reference time points are
determined as impulse positions and when said fluctuation
is in excess of said predetermined limit range, impulse
positions are determined by performing processings of
addition of a time point to said reference time points or
omission of one of said reference time points or shift of
one of said reference time points so that the fluctuation
in the intervals of the processed reference time points
is held within said limit range; and
impulse magnitude calculating means whereby the
magnitude values of impulses at said determined impulse
positions are determined so that an error between a
waveform of a synthesized speech obtainable by exciting
said all-pole filter means with said impulse sequence and
a waveform of said phase-equalized speech is minimized.

7. The apparatus of claim 4 or 5, wherein said
linear predictive analysis means includes means for
determining whether said input speech signal in an
analysis window of a fixed length is voiced or unvoiced
and for outputting a voiced/unvoiced decision signal, and
said apparatus further includes random pattern generating
means for generating a random pattern which minimizes an

- 40 -


error between a waveform of a synthesized speech
obtained by exciting said all-pole filter means with one
of a plurality of predetermined random patterns and a
waveform of said phase-equalized speech in a window
during which said decision signal is unvoiced.

8. The apparatus of claim 4 or 5, wherein said
impulse sequence generating means includes vector
quantizing means for vector quantizing the magnitude
values of said impulses determined by said impulse
magnitude calculating means, whereby said impulse
sequence has said quantized magnitude values.

9. A speech synthesizing apparatus for receiving
parameters representing a speech in accordance with the
received parameters, said parameters representing the
excitation signal including parameters that represent
impulse positions of a sequence of impulses in a phase-
equalized residual of the speech waveform and parameters
that represent zero-filter coefficients, said apparatus
comprising:
impulse sequence generating means for
generating a sequence of impulses at impulse positions
respectively designated by said parameters representing
said impulse positions;
zero filter means having an impulse response
which assumes values at each impulse position and at
predetermined number of successive sample points on
either side of said each impulse position and further at
a midpoint between adjacent impulse positions and at a
predetermined number of successive sample points on
either side of said midpoint, said zero filter means
being supplied with said sequence of impulses from said
impulse sequence generating means and excited under


- 41 -

control of zero filter coefficients supplied thereto as
one of said parameters representing said excitation
signal for providing said sequence of impulses with a
shape resembling a phase-equalized residual of the
speech; and
all-pole filter means excited by the output of
said zero filter means under control of prediction
coefficients supplied thereto as another one of the
parameters representing said speech waveform, said
prediction coefficients representing a speech spectral
envelope characteristic.

Description

Note: Descriptions are shown in the official language in which they were submitted.


-1- 2~2~a



SPEECH ANALYSIS-S~Nl~SIS METHOD
AND APPARATUS THEREFOR

BACKGROUND OF T~E l~v~NlION
The present invention relates to a speech
analysis-synthesis method and apparatus in which a linear
filter representing the spectral envelope characteristic
of a speech is excited by an excitation signal to
synthesize a speech signal.
Heretofore, linear predictive vocoder and
multipulse predictive coding have been proposed for use
in speech analysis-synthesis systems of this kind. The
linear predictive vocoder is now widely used for speech
coding in a low bit rate region below 4.8 kb/s and this
system includes a PARCOR system and a line spectrum pair
(LSP) system. These systems are described in detail in
Saito and Nakata, "Fundamentals of Speech Signal
Processing," ACADEMIC PRESS, INC., 1985, for instance.
The linear predictive vocoder is made up of an all-pole
filter representing the spectral envelope characteristic
of a speech and an excitation signal generating part for
generating a signal for exciting the all-pole filter.
The excitation signal is a pitch frequency impulse
sequence for a voiced sound and a white noise for an
unvoiced sound. Excitation parameters are the
distinction between voiced and unvoiced sounds, the pitch
frequency and the magnitude of the excitation signal.
These parameters are extracted as average features of the
speech signal in an analysis window about 30 msec. In
the linear predictive vocoder, since speech feature
parameters extracted for each analysis window as
mentioned above are interpolated temporarily to
synthesize a speech, features of its waveform cannot be
'~

~2~


reproduced with sufficient accuracy when the pitch
frequency, magnitude and spectrum characteristic of the
speech undergo rapid changes. Furthermore, since the
excitation signal composed of the pitch frequency impulse
sequence and the white noise is insufficient for
reproducing features of various speech waveforms, it is
difficult to produce a highly natural-sounding
synthesized speech. To improve the quality of the
synthesized speech in the linear predictive vocoder, it
is considered in the art to use excitation which permits
more accurate reproduction of features of the speech
waveform.
On the other hand, the multipulse predictive
coding is a method that uses excitation of higher
producibility than in the conventional vocoder. With
this method, the excitation signal is expressed using a
plurality of impulses and two all-pole filters
representing proximity correlation and pitch correlation
characteristics of speech are excited by the excitation
signal to synthesize the speech. The temporal positions
and magnitudes of the impulses are selected such that an
error between input original and synthesized speech
waveforms is minimized. This is described in detail in
B.S. Atal, "A New Model of LPC Excitation for Producing
2S Natural-Sounding Speech at Low 3it Rates," IEEE Int.
Conf. on ASSP, pp 614-617, 1982. With the multipulse
predictive coding, the speech quality can be enhanced by
increasing the number of impulses used, but when the bit
rate is low, the number of impulses is limited, and
consequently, reproducibility of the speech waveform is
impaired and no sufficient speech quality can be
obtained. It is considered in the art that an amount of
information of about 8 kb/s is needed to produce high




speech quality. 2 0 2 6 6 4 0
In the multipulse predictive coding, excitation
is determined so that the input speech waveform itself is
reproduced. On the other hand, there has also been
proposed a method in which a phase-equalized speech
signal resulting from equalization of a phase component
of the speech waveform to a certain phase is subjected to
multipulse predictive coding, as set forth in United
States Patent No. 4,850,022 issued to the inventor of
this application. This method improves the speech
quality at low bit rates, because the number of impulsea
for reproducing the excitation signal can be reduced by
removing from the speech waveform the phase component of
a speech which is dull in terms of human hearing. With
this method, however, when the bit rate drops to 4.8 kb/s
or so, the number of impulses becomes insufficient for
reproducing features of the speech waveform with high
accuracy and no high quality speech can be produced,
either.
SUMMARY OF THE INVENTION
It is therefore an object of the present
invention to provide a speech analysis-synthesis method
and apparatus which permit the production of high quality
speech at bit rates ranging from 2.4 to 4.8 kb/s, i.e. in
the boundary region between the amounts of information
needed for the linear predictive vocoder and for the
speech waveform coding.
According to the present invention, a zero
filter is excited by a quasi-periodic impulse sequence
derived from a phase-equalized prediction residual of an
input speech signal and the resulting output signal from
the zero filter i8 used as an excitation signal for a

2~6~4~


voiced sound in the speech analysis-synthesis. The
coefficients of the zero filter are selected such that an
error between a speech waveform synthesized by exciting
an all-pole prediction filter by the excitation signal
and the phase-equalized input signal is minimized. The
zero filter, which is placed under the control of the
thus selected coefficients, can synthesize an excitation
signal accurately representing features of the prediction
residual of the phase-equalized speech, in response to
the above-mentioned quasi-periodic impulse sequence. By
using the position and magnitude of each impulse of an
input impulse sequence and the coefficients of the zero
filter as parameters representing the excitation signal,
a high quality speech can be synthesized with a smaller
amount of information.
Based on the pitch frequency impulse sequence
obtained from the phase-equalized prediction residual, a
guasi-periodic impulse sequence having limited
fluctuation in its pitch period is produced. By using
the quasi-periodic impulse sequence as the above-
mentioned impulse sequence, it is possible to further
reduce the amount of parameter information representing
the impulse sequence.
In the conventional vocoder the pitch period
impulse sequence composed of the pitch period and
magnitudes obtained for each analysis window is used as
the excitation signal, whereas in the present invention
the impulse position and magnitude are determined for
each pitch period and, if necessary, the zero filter is
introduced, with a view to enhancing the reproducibility
of the speech waveform. In the conventional multipulse
predictive coding a plurality o~ impulses are used to
represent the excitation signal of one pitch period,


2026640

whereas in the present invention the excitation signal is
represented by impulses each per pitch and the
coefficients of the zero filter set for each fixed frame
so as to reduce the amount of information for the
excitation signal. Besides, the prior art employs, as a
criterion for determining the excitation parameters, an
error between the input speech waveform and the
synthesized speech waveform, whereas the present
invention uses an error between the input speech waveform
and the phase-equalized speech waveform. By using a
waveform matching criterion for the phase-equalized
speech waveform, it is possible to improve matching
between the input speech waveform and the speech waveform
synthesized from the excitation signal used in the
present invention. Since the phase-equalized speech
waveform and the synthesized one are similar to each
other, the number of excitation parameters can be reduced
by determining them while comparing the both speech
waveforms.
BRIEF DESCRIPTION OF THE DRA~INGS
Fig. 1 is a block diagram illustrating an
embodiment of the speech analysis-synthesis method
according to the present invention;
Fig. 2 is a block diagram showing an example of
a phase equalizing and analyzing part 4;
Fig. 3 is a diagram for explaining a quasi-
periodic impulse excitation signal;
Fig. 4 is a flowchart of an impulse position
generating process;
Fig. 5A is a diagram for explaining the
insertion of an impulse position in Fig. 4;
Fig. 5B iS a diagram for explaining the removal

-6- ~a ~ 6 4



of an impulse position in Fig. 4;
Fig. 5C is a diagram for explaining the shift of
an impulse position in Fig. 4;
Fig. 6 is a block diagram illustrating an
example of an impulse magnitude calculation part 8;
Fig. 6A is a block diagram illustrating a
frequency weighting filter processing part 39 shown in
Fig. 6;
Fig. 7A i8 a diagram showing an example of the
waveform of a phase-equalized prediction residual;
Fig. 7s is a diagram showing an impulse response
of a zero filter;
Fig. 8 is a block diagram illustrating an
example of a zero filter coefficient calculation part 11;
lS Fig. 9 is a block diagram illustrating another
example of the impulse magnitude calculation part 8; and
Fig. 10 is a diagram showing the results of
comparison of synthesized speech quality between the
present invention and the prior art.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Fig. 1 illustrates in block form the
constitution of the speech analysis-synthesis system of
the present invention. A sampled digital speech signal
s(t) is input via an input terminal 1. In a linear
predictive analyzing part 2 samples of-N speech signals
are once stored in a data buffer for each analysis-window
and then these samples are subjected to a linear
predictive analysis by a known linear predictive coding
method to calculate a set of prediction coefficients ai
(where i = 1, 2, ..., p). In the linear predictive
analyzing part 2 a prediction residual signal e(t) of the
input speech signal s(tj is obtained by an inverse filter

-7- ~ 6 4 ~



(not shown) which uses the set of prediction coefficients
as its filter coefficients. Based on the decision of the
level for a maximum value of an auto-correlation function
of the prediction residual signal, it is determined
whether the speech is voiced (V) or unvoiced (U) and a
decision signal W is output accordingly. This
processing is described in detail in the afore-mentioned
literature by Saito, et al. The prediction coefficients
ai obtained in the linear predictive analyzing part 2 is
provided to a phase equalizing-analyzing part 4 and, at
the same time, it is quantized by a quantizer 3.
In the phase equalizing-analyzing part 4
coefficients of a phase equalizing filter for rendering
the phase characteristic of the speech into a zero phase
and reference time points of phase equalization are
computed. Fig. 2 shows in detail the constitution of the
phase equalizing-analyzing part 4. The speech signal
s(t) is applied to an inverse filter 31 to obtain the
prediction residual e(t). The prediction residual e(t)
is provided to a maximum magnitude position detecting
part 32 and a phase equalizing filter 37. A switch
control part 33C monitors the decision signal W fed from
the linear predictive analyzing part 2 and normally
connects a switch 33 to the output side of a magnitude
comparing part 38, but when the current window is of a
voiced sound V and the immediately preceding frame is of
an unvoiced sound U, the switch 33 is connected to the
output side of the maximum magnitude position detecting
part 32. In this instance, the maximum magnitude
position detecting part 32 detects and outputs a sample
time point t'p at which the magnitude of the prediction
residual e(t) is maximum.
Let it be assumed that smoothed phase-equalizing


202~i6~0

filter coefficients ht~i(k) have been obtained for the
currently determined reference time point t'i at a
coefficient smoothing part 35. The coefficients ht~i(k)
are supplied from the filter coefficient holding part 36
to the phase equalizing filter 37. The prediction
residual e(t), which is the output of the inverse filter
31, is phase-equalized by the phase equalizing filter 37
and output therefrom as phase-equalized prediction
residual ep(t). It is well known~that when the input
speech signal s(t) is a voiced sound signal, the
prediction residual e(t) of the speech signal has a
waveform having impulses at the pitch intervals of the
voiced sound. The phase equalizing filter 37 produces an
effect of emphasizing the magnitudes of impulses of such
pitch intervals.
The magnitude comparing part 38 compares levels
of the phase-equalized prediction residual ep(t) with a
predetermined threshold value, determines, as an impulse
position, each sample time point where the sample value
exceeds the threshold value, and outputs the impulse
position as the next reference time point t'i+l on the
condition that an allowable minimum value of-the impulse
intervals is Lmin and the next reference time point t'i+
is searched for from sample points spaced more than the
value Lmin apart from the time point t'i.
When the frame is an unvoiced sound frame, the
phase-equalized residual ep~t) during the unvoiced sound
frame is composed of substantially random components (or
white noise) which are considerably lower than the
threshold value mentioned above, and the magnitude
comparing part 38 does not produce, as an output of the
phase equalizing-analyzing part 4, the next reference
time point t'i~l. Rather, the magnitude comparing part


f~

-9- 2~26~4o


38 determines a dummy reference time point t'i+l at, for
example, the last sample point of the frame (but not
limited thereto) so as to be used for determination of
smoothed filter coefficients at the smoothing part 35 as
will be explained later.
In response to the next reference time point
t'i+l thus obtained in the voiced sound frame, a filter
coefficient calculating part 34 calculates (2M+l) filter
coefficients h*(k) of the phase equalizing filter 37 in
accordance with the following equation:
¦ M




h*(k) = e(t'i+l-k)/ ¦ ~ e(t i+l+n) ... (1)
~ n=-M
where k = -M, -(M-l), ..., O, 1, ..., M.
On the other hand, when the frame is of an unvoiced sound
frame, the filter coefficient calculating part 34
calculates the filter coefficients h*(k) of the phase
equalizing filter 37 by the following equation:
~1 for k = O
20h*~k~ = lo for k $ ..... (2)

where k = -M, ..., M.
The characteristic of the phase-equalizing filter 37
expressed by Eq. (2) represents such a characteristic
that the input signal thereto is passed therethrough
intact.
The filter coefficients h*(k) thus calculated
for the next reference time point t'i+l are smoothed by
the coefficient smoothing part 35 as will be described
later to obtain smoothed phase equalizing filter
coefficients ht~i+l(k), which are held by the coefficient
holding part 36 and supplied as updated coefficients
ht~(k) to the phase equalizing filter 37. The phase

-lO- 20~640


equalizing filter 37 having its coefficients thus updated
phase-equalizes the prediction residual e(t) again, and
based on its output, the next impulse position, i.e., a
new next reference time point t'i+l is determined by the
magnitude comparing part 38. In this way, a next
reference time point t'i+l is determined based on the
phase-equalized residual ep(t) output from the phase
equalizing filter 37 whose coefficients have been set to
ht~i(k) and, thereafter, new smoothed filter coefficients
ht~i+l(k) are calculated for the reference time point
t'i~l. By repeating these processes using the reference
time point t'i~l and the smoothed filter coefficients
ht'i+l(k) as new t'i and ht~i(k)~ reference time points in
each frame and the smoothed filter coefficients ht~i(k)
for these reference time points are determined in a
sequential order.
In the case where a speech is initiated after a
silent period or where a voiced sound is initiated after
unvoiced sounds continued, the prediction residual e(t)
including impulses of the pitch frequency are provided,
for the first time, to the phase equalizing filter 37
having set therein the filter coefficients given
essentially by Eq. (1). In this instance, the magnitudes
of impulses are not emphasized and, consequently, the
prediction residual e(t) is output intact from the filter
37. Hence, when the magnitudes of impulses of the pitch
frequency happen to be smaller than the threshold value,
the impulses cannot be detected in the magnitude
comparing part 38. That is, the speech is processed as
if no impulses are contained in the prediction residual,
and consequently, the filter coefficients h*(k~ for the
impulse positions are not obtained -- this is not
preferable from the viewpoint of the speech quality in

~a~6~

the speech analysis-synthesis.
To solve this problem, in the Fig. 2 embodiment,
when the input speech signal analysis window changes from
an unvoiced sound frame to a voiced sound frame as
mentioned above, the maximum magnitude detecting part 32
detects the maximum magnitude position t'p of the
prediction residual e(t) in the voiced sound frame and
provides it via the-switch 33 to the filter coefficient
calculating part 34 and, at the same time, outputs it as
a reference time point. The filter coefficient
calculating part 34 calculates the filter coefficients
h*(k), using the reference time point t'p in place of
t'i+l in Eq. (2).
Next, a description will be given of the
smoothing process of the phase equalizing filter
coefficients h*(k) by the coefficient smoothing part 35.
The filter coefficients h*(k) determined for the next
reference time point t'i+l and supplied to the smoothing
part 35 are smoothed temporarily by a filtering process
of first order expressed by, for example, the following
recurrence formula:
ht(k) = bht_l(k) + (l-b)h*(k) ...
where: t'i < t ~ t i~l
The coefficient b is set to a value of about
0.97. In Eq. (3), ht_l(k) represents smoothed filter
coefficients at an arbitrary sample point (t-l) in the
time interval between the current reference time point
t'i and the next reference time point t'i+l, and ht(k)
represents the smoothed filter coefficients at the next
sample point. This smoothing takes place for every
sample point from a sample point next to the current
reference time point t'i, for which the smoothed filter
coefficients have already been obtained, to the next

-12-
2~6~a


reference time point t'i+l for which the smoothed filter
coefficients are to be obtained next. The filter
coefficient holding part 36 holds those of the thus
sequentially smoothed filter coefficients ht(k) which
were obtained for the last sample point which is the next
reference time point, that is, ht ~+l(k), and supplies
them as updated filter coefficients ht-i(k) to the phase
equalizing filter 37 for further determination of a
subsequent next reference time point.
The phase equalizing filter 37 is supplied with
the prediction residual e(t) and calculates the phase-
equalized prediction residual ep(t) by the following
equation:
M




ep(t) = ~ ht~i(k)e(t - k) -- (4)
k--M
The calculation of Eq. (4) needs only to be performed
until the next impulse position is detected by the
magnitude comparing part 38 after the reference time
point t'i at which the above-said smoothed filter
coefficients were obtained. In the magnitude comparing
part 38 the magnitude level of the phase-equalized
prediction residual ep(t) is compared with a threshold
value, and the sample point where the former exceeds the
latter is detected as the next reference time point t'i~l
in the current frame. Incidentally, in the case where no
magnitude exceeds the threshold value within a
predetermined period after the latest impulse position
(reference time point) t'i, processing is performed by
which the time point where the phase-equalized prediction
residual ep(t) takes the maximum magnitude until then is
detected as the next reference time point t'~
The procedure for obtaining the reference time

-13- ~ 266 ~a



point t'i and the smoothed filter coefficients ht-~(k) at
that point as described above may be briefly summarized
in the following outline.
Step 1: At first, the phase-equalized prediction
residual ep(t) is calculated by Eq. (4) using
the filter coefficients ht~i(k) set in the phase
equalizing filter 37 until then, that is, the
smoothed filter coefficients obtained for the
last impulse position in the preceding frame,
and the prediction residual ep(t) of the given
frame. This calculation needs only to be
performed until the detection of the next
impulse after the preceding impulse position.
Step 2: The magnitude of the phase-equalized prediction
residual is compared with a threshold value in
the magnitude comparing part 38, the sample
point at which the residual exceeds the
threshold value is detected as an impulse
position, and the first impulse position ti+l (i
= 0, that is, tl) in the current frame is
obtained as the next reference time point.
Step 3: The coefficients h*(k) of the phase equalizing
filter at the reference time point tl is
calculated substituting the time point tl for
t'i+l in Eq- (l).
Step 4: The filter coefficients h*(k) for the first
reference time tl is substituted into Eq. (3),
and the smoothed filter coefficients ht(k) at
each of sample points after the preceding
impulse position (the last impulse position to
in the preceding frame) are calculated by Eq.
(3) until the time point of the impulse position
tl. The smoothed filter coefficients at the

-14- 2 0 ~



reference time point tl obtained as a result is
represented by htl(k).
Step 5: The phase-equalized prediction residual ep(t) is
calculated substituting the smoothed filter
coefficients htl(k) for the reference time point
tl into Eq. (4). This calculation is performed
for a period from the reference time point tl to
the detection of the next impulse position
(reference time point) t2.
Step 6: The second impulse position t2 of the phase-
equalized prediction residual thus calculated is
determined in the magnitude comparing part 38.
Step 7: The second impulse position t2 is substituted
for the reference time point t'i~l in Eq. (l)
and the phase equalizing filter coefficients
h*(k) for the impulse position t2 are
calculated.
Step 8: The filter coefficients for the second impulse
position t2 is substituted into Eq. (4) and the
smoothed filter coefficients at respective
sample points are sequentially calculated
starting at a sample point next to the first
impulse position tl and ending at the second
impulse position t2. As a result of this, the
smoothed filter coefficients ht2(k) at the
second impulse position t2 are obtained.
Thereafter, steps 5 through 8, for example, are
repeatedly performed in the same manner as mentioned
above, by which the smoothed filter coefficients ht~(k)0 at all impulse positions in the frame can be obtained.As shown in Fig. 1, the smoothed filter
coefficients ht(k) obtained in the phase equalizing-
analyzing part 4 are used to control the phase equalizing

1~- 2~2~64~



filter 5. By inputting the speech signal s(t) into the
phase equalizing filter 5, the processing expressed by
the following equation is performed to obtain a phase-
equalized speech signal Sp(t).
M
Sp(t) = ~ ht(k)s(t - k) -- (5)
k=-M
Next, an excitation parameter analyzing part 30
will be described. In the analysis-synthesis method of
the present invention different excitation sources are
used for voiced and unvoiced sounds and a switch 17 is
changed over by the voiced or unvoiced sound decision
signal W. The voiced sound excitation source comprises
an impulse sequence generating part 7 and an all-zero
filter (hereinafter referred to simply as zero filter)
10 .
The impulse sequence generating part 7 generates
such a quasi-periodic impulse sequence as shown in Fig. 3
in which the impulse position ti and the magnitude mi of
each impulse are specified. The temporal position (the
impulse position) ti and the magnitude mi of each impulse
in the quasi-periodic impulse sequence are represented as
parameters. The impulse position ti is produced by an
impulse position generating part 6 based on the reference
time point t'i, and the impulse magnitude mi is
controlled by an impulse magnitude calculating part 8.
In the impulse position generating part 6 the
interval between the reference time points (representing
the positions of impulses of the pitch frequency in the
phase-equalized prediction residual) determined in the
phase equalizing-analyzing part 4 is controlled to be
~uasi-periodic so as to reduce fluctuations in the
impulse position and hence reduce the amount of

-16-
2P~66~LD


information necessary for representing the impulse
position. That is, the interval, Ti = ti ~ ti_l, between
impulses to be generated, shown in Fig. 3, is limited so
that a difference in the interval between successive
impulses is equal to or smaller than a fixed allowable
value J as expressed by the following equation:
ATi = ¦Ti ~ Ti-ll - J ..- (6)
Next, a description will be given, with
reference to Fig. 4, of an example of the impulse
position generating procedure which the impulse position
generating part 6 implements.
Step Sl: When all the reference time points t'~ (where i
= 1, 2, ...) in the current frame are input from
the phase equalizing-analyzing part 4, the
process proceeds to the next step S2 if the
preceding frame is a voiced sound frame (the
current frame being also a voiced sound frame).
Step S2: ~ calculation is made of a difference, ATl = T~
- Ti_l, between two successive intervals Ti =
t i ~ ti-l and Ti-l = ti_l - ti_2 of the first
reference time point ti (where i = 1) and the
two impulse positions ti_l and ti_2 already
determined by the processing in Fig. 4 (already
determined by the processing in Fig. 4 for the
last two reference time points ti_2 and t~_l in
the preceding frame).
Step S3: The absolute value of the difference ~Tl is
compared with the predetermined value J. When
the former is equal to or smaller than the
latter, it is determined that the input
reference time point t'i is within a
predetermined variation range, and the process
proceeds to step S4. When the former is greater


~2~

than the latter, it is determined that the
reference time point t'i varies in excess of the
predetermined limit, and the proce~s proceeds to
step S6.
Step S4: Since the reference time point t'i is within the
predetermined variation range, this reference
time point i8 determined as the impulse position
ti .
Step S5: It is determined whether or not processing has
been completed for all the reference time points
t'i in the frame, and if not, the process goes
back to step S2, starting processing for the
next reference time point ti+l. If the
processing for all the reference time points has
been completed, then the process proceeds to
step Sl7.
Step S6: A calculation is made of a difference, ~T2 =
(t i ~ ti-1)/2 - (ti_l - ti_2), between half of
the interval Ti between the impulse position
ti_l and the reference time point t'i and the
already determined interval Ti_l.
Step S7: The absolute value of the above-mentioned
difference ~T2 is compared with the value J, and
if the former is equal to or smaller than the
latter, the interval Ti is about twice larger
than the decided interval Ti_l as shown in Fig.
5A; in this case, the process proceeds to step
S8 .
Step S8: An impulse position tc is set at about the
3~ middle between the reference time point t'i and
the preceding impulse position ti_l, and the
reference time point t'i is set at the impulse
position ti+l and then the process proceeds to

2~26~o


step Ss.
Step Sg: When the condition in step S7 is not satisfied,
a calculation is made of a difference, AT3,
between the interval from the next reference
time point t'i+l to the impulse position ti_l and
the decided interval from the impulse position
ti_l to ti-2
Step S10:The absolute value of the above-mentioned
difference AT3 is compared with the value J.
When the former i9 equal to or smaller than the
latter, the reference time point t'i+l is within
an expected range of the impulse position ti
next to the decided impulse position ti_l and
the reference time point t'i is outside the
range and in between t'i+l and ti_l. The proces8
proceeds to step Sll.
Step Sll:The excess reference time point t'i shown in
Fig. 5B is discarded, but instead the reference
time point t'i+l is set at the impulse position
ti and the proces~ proceeds to step Ss~
Step Sl2:Where the condition in step S10 is not
satisfied, a calculation is made of a difference
AT4 between half of the interval between the
reference time point t'i+l and the impulse
position ti_l and the above-mentioned decided
interval Ti_l.
Step Sl3:The absolute value of the difference AT4 is
compared with the value J. When the former is
equal to or smaller than the latter, it means
that the reference time point t'i+l is within an
expected range of the impulse position ti+l next
to that ti as shown in Fig. 5C and that the
reference time point t'i is either one of two

-19- 2~6~



reference time points t'i shown in Fig. 5C and
is outside an expected range of the impulse
position ti. In this instance, the process
proceeds to step Sl4. Step Sl4:The reference time point t'i+l is set as theimpulse position ti+l, and at the same time, the
reference time point t'i is shifted to the
middle between t'i+l and ti_l and set as the
impulse position ti, that is, ti = (t'i+l +
ti_l)/2. The process proceeds to step S5.
Step Sl5:Nhere the condition in step S14 is not
satisfied, the reference time point t'i is set
as the impulse position ti without taking any
step for its inappropriateness as a pitch
position. The process proceeds to step S5.
Step Sl6:Where the preceding frame is an unvoiced sound
frame in step Sl, all the reference time points
t'i in the current frame are set to the impulse
positions ti.
Step Sl7:The number of impulse positions is compared with
a predetermined maximum permissible number of
impulses Np, and if the former is equal to or
smaller than the latter, then the entire
processing is terminated. The number Np is a
fixed integer ranging from 5 to 6, for example,
and this is the number of impulses present in a
15 msec frame in the case where the upper limit
of the pitch frequency of a speech is regarded
as ranging from about 350 to 400 Hz at the
highest.
Step Sl8:Where the condition in step Sl7 is not
satisfied, the number of impulse positions is
greater than the number Np; so that magnitudes

-20- ~Q ~



of impulses are calculated for the respective
impulse positions by the impulse magnitude
calculating part 8 in Fig. 1 as described later.
Step S19:An impulse position selecting part 6A in Fig. 1
chooses Np impulse positions in the order of
magnitude and indicates the chosen impulses to
the impulse position generating part 6, with
which the process is terminated.
According to the processing described above in
respect of Fig. 4, even if the impulse position of the
phase-equalized prediction residual which is detected as
the reference time point t'i undergoes a substantial
change, a fluctuation of the impulse position ti which i8
generated by the impulse position generating part 6 i8
limited within a certain range. Thus, the amount of
information necessary for representing the impulse
position can be reduced. Moreover, even in the case
where the impulse magnitude at the pitch position in the
phase-equalized prediction residual happens to be smaller
than a threshold value and cannot be detected by the
magnitude comparing part 38 in Fig. 2, an impulse signal
is inserted by steps S7 and S8 in Fig. 4; so that the
quality of the synthesized speech is not essentially
impaired in spite of a failure in impulse detection.
In the impulse magnitude calculating part 8 the
impulse magnitude at each impulse position t~ generated
by the impulse position generating part 6 is selected so
that a frequency-weighted mean square error between a
synthesized speech waveform Sp'(t) produced by exciting
such an all-pole filter 18 with the impulse sequence
created by the impulse sequence generating part 7 and an
input speech waveform Sp(t) phase-equalized by a phase
equalizing filter 5 may be eventually minimized. Fig. 6

-21-
202~640
shows the internal construction of the impulse magnitude
calculating part 8. The phase-equalized input speech
waveform Sp(t) is supplied to a frequency weighting
filter processing part 39. The frequency weighting
filter processing part 39 acts to expands the band width
of the resonance frequency components of a speech
spectrum and its transfer characteristic is expressed as
follows:
HW(Z) = A(Z/)) -- (7)
lo where:
A(z) = 1 + alz~l + ~- + apz~P ... (8)
where P is an analysis degree, al, a2, ..., ap are the
linear prediction coefficients and z-l is a sampling
delay, ~ is a parameter which controls the degree of
suppression and is in the range of 0 < ~ ~ 1, and the
degree of suppression increases as the value of ~
decreases. Usually, ~ is in the range of 0.7 to 0.9.
The frequency weighting filter processing part
39 has such a construction as shown in Fig. 6A. The
linear prediction coefficients ai are provided to a
frequency weighting filter coefficient calculating part
39A, in which coefficients yiai of a filter having a
transfer characteristic A( Z/r) are calculated. A
frequency weighting filter 39B calculates coefficients of
a filter having a transfer characteristic Hw(z) =
A(z)/A(z/y), from the linear prediction coefficients ai
and the frequency-weighted coefficients yiai and at the
same time, the phase-equalized speech Sp(t) is passed
through the filter of that transfer characteristic to
obtain a signal S'w(t).
A zero input response calculating part 39C uses,
(n-l)
as an initial value, a synthesized speech s(t) obtained
C

-22- 2~ 4~



as the output of an all-pole filter 18A (see Fig. 1) of a
transfer characteristic l/A(z/y) in the preceding frame
and outputs an initial response when the all-pole filter
18A is excited by a zero input.
A target signal calculating part 39D subtracts
the output of the zero input response calculating part
39C from the output S'w(t) of the frequency weighting
filter 39B to obtain a frequency-weighted signal Sw(t).
On the other hand, the output y~ai of the frequency
weighting filter coefficient processing part 39A is
supplied to an impulse response calculating part 40 in
Fig. 6, in which an impulse response f(t) of a filter
having the transfer characteristic l/A(z/y) is
calculated.
A correlation calculating part 41 calculates,
for each impulse position ti, a cross correlation ~(i)
between the impulse response f(t - ti) and the frequency-
weighted signal Sw(t) as follows:
N-l
~(i) = ~ f(t - ti)Sw(t) ................. (9)
t=0
where i = 1, 2, ..., np, np being the number of impulse~
in the frame and N the number of samples in the frame.
Another correlation calculating part 42
calculates a covariance ~(i, j) of the impulse response
for a set of impulse positions ti, t~ as follows:
N-l
~(i, j) = ~ f(t - t~)f(t - tj) ... (10)
t=0
An impulse magnitude calculating part 43 obtains
impulse magnitudes mi from ~(t) and ~(i, j) by solving
the following simultaneous equations, which equivalently
minimize a mean square error between a synthesized speech

-23- 2G2~6 4



waveform obtainable by exciting the all-pole filter 18
with the impulse sequence thus determined and the phase-
equalized speech waveform Sp(t).

~(1, 1) ....... ~(1, np) m~
~(2, 1) ....... ~(2, np) m2 ~(2)
,
-- (11) -

'' ---. : ,

,~(np, 1) ...... ~(np, np) , mnp ,~(np)
The impulse magnitudes mi are quantized by the quantizer
9 in Fig. 1 for each frame. This is carried out by, for
example, a scalar quantization or vector quantization
method. In the case of employing the vector
u=quantization technique, a vector (a magnitude pattern)
using respective impulse magnitudes mi as its elements is
compared with a plurality of predetermined standard
impulse magnitude patterns and i8 quantized to that one
of them which minimizes the distance between the
patterns. A measure of the distance between the
magnitude patterns corresponds essentially to a mean
square error between the speech waveform Sp'(t)
synthesized, without using the zero filter, from the
standard impulse magnitude pattern selected in the
quantizer 9 and the phase-equalized input speech waveform
Sp(t). For example, letting the magnitude pattern vector
obta-ned by solving Eq. (11) be represented by m = (ml,
m2, ..., mnp) and letting standard pattern vectors stored
as a table in the quantizer 9 be represented by mc~ (i =
1, 2, ..., Nc), the mean square error is given by the
following equation:
d(m, mc) ~ tm - mC~)t~(m - mc~) ... ~12)

- -24- ~ 64 a



where t represents the transposition of a matrix and ~
is a matrix using, as its elements, the auto-covariance
~(i, j) of the impulse response. In this case, the
quantized value m of the above-mentioned magnitude
pattern is expressed by the following equation, as a
standard pattern which minimizes the mean square error
d(m, mc) in Eq, (12) in the afore-mentioned plurality of
standard pattern vectors mci.
m = arg min d(m, mc~) ... (13)
mc~
The zero filter 10 i5 to provide an input
impulse sequence with a feature of the phase-equalized
prediction residual waveform, and the coefficients of
this filter are produced by a zero filter coefficient
calculating part 11. Fig. 7A shows an example of the
phase-equalized prediction residual waveform ep(t) and
Fig. 7B an example of an impulse response waveform of the
zero filter 10 for the input impulse thereto. The phase-
equalized prediction residual ep(t) has a flat spectral
envelope characteristic and a phase close to zero, and
hence is impulsive and large in magnitude at impulse
positions ti, ti+l, ... but relatively small at other
positions. The waveform is substantially symmetric with
respect to each impulse position and each midpoint
between adjacent impulse positions, respectively. In
many cases, the magnitude at the midpoint is relatively
larger than at other positions (except for impulse
positions) as will be seen from Fig. 7A, and this
tendency increases for a speech of a long pitch
frequency, in particular. The zero filter 10 is set so
that its impulse response assume values at successive q
sample points on either side of the impulse position t~
and at successive r sample points on either side of the

-25 202~



midpoint between the adjacent impulse positions t~ and
ti+l, as depicted in Fig. 7B. In this instance, the
transfer characteristic of the zero filter 10 i8
expressed as follows:
q r
v(z) = ~ VkZ + ~ vk~r+q+lz~(k+Ti/2) ... (14)
k=-q k=-r
In the zero filter coefficient calculating part
11, for an impulse sequence of given impulse positions
and impulse magnitudes, filter coefficients vk are
determined such that a frequency-weighted mean square
error between the synthesized speech waveform Sp'(t) and
the phase-equalized input speech waveform Sp(t) may be
minimum. Fig. 8 illustrates the construction of the
filter coefficient calculating part 11. A frequency
weighting filter processing part 44 and an impulse
response calculating part 45 are identical in
construction with the frequency weighting filter
processing part 39 and the impulse response calculating
part 40 in Fig. 6, respectively. An adder 46 adds the
output impulse response f(t) of the impulse response
calculating part 45 in accordance with the following
equation:
/np
~lmjf(t-tj+k) for ¦kl 5 q

uk(t) = ~ np
mjf(t-ti+k-e+Ti/2) for ¦ k-e ¦ 5 r
\~
... (15)
where e = q + r + 1.
A correlation calculating part 47 calculates the
cross-covariance ~(i) between the signals Sw(t) and

-26- ~ a266~ 0



ui(t), and another correlation calculating part 48
calculates the auto-covariance ~(i, j) between the
~ignals ul(t) and u~(t). A filter coefficient
calculating part 49 calculates coefficients vi of the
zero filter 10 from the above-said cro~ correlation
~(i) and covariance ~(i, j) by ~olving the following
simultaneous equations:
~(-q, -q) ...... ~(-q, q+2r+1) v_q
~(-q+i, -q) .... ..~(-q+l, q+2r+1) v_q+
. .
~.,. , , I

~(q+2r+1, -q) .. ...~(q+2r+1, q+2r+1: , vq+2r+

~(~q)
~(-q+l)
-- (16)

,~(q+2r+l)
These solutions eventually minimize a mean square error
between a synthesized speech waveform obtainable by
exciting the all-pole filter 18 with the output of the
zero filter 10 and the phase-e~ualized speech waveform
Sp(t).
The filter coefficient v~ i5 quantized by a
quantizer 12 in Fig. 1. This is performed by use of a
scalar quantization or vector quantization technique, for
example. In the case of employing the vector
quantization technique, a vector (a coefficient pattern)
using the filter coefficients vi as its elements is
compared with a plurality of predetermined standard
coefficient patterns and is quantized to a standard

-27- ~26~



pattern which minimizes the distance between patterns.
If a measure essentially corresponding to the mean square
error between the synthesized speech waveform Sp'(t) and
the phase-equalized input speech waveform Sp(t) is used
as the measure of distance as in the case of the vector
quantization of the impulse magnitude by the afore-
mentioned quantizer 9, the quantized value v of the
filter coefficients-is obtained by the following
equation:
v = arg min d(v~ Vcl)
vci
d(v, vc) = (v - vci)t~(v - vci) ... (17)
where v is a vector using, as its elements, coefficients
v_qr v_q~lr ..., vq+2r+l obtained by solving Eq. (16), and
vci is a standard pattern vector of the filter
coefficients. Further, ~ is a matrix using as its
elements the covariance ~(i, j) of the impulse response
Ui(t) -
To sum up, in the voiced sound frame the speech
signal Sp'(t) is synthesized by exciting an all-pole
filter featuring the speech spectrum envelope
characteristic, with a quasi-periodic impulse sequence
which is determined by impulse positions based on the
phase-equalized residual ep(t) and impulse magnitudes
determined so that an error of the synthesized speech is
minimum. Of the excitation parameters, the impulse
magnitudes mi and the coefficients vi of the zero filter
are set to optimum values which minimize the matching
error between the synthesized speech waveform Sp'(t) and
the phase-equalized speech waveform Sp(t).
Next, excitation in the unvoiced sound frame
will be described. In the unvoiced sound frame a random
pattern is used as an excitation signal as in the case of

-28- ~G~4~



code excited linear predictive coding (Schroeder, et al.,
"Code excited linear prediction (CELP)", IEEE Int. On
ASSP, pp 937-940, 1985). A random pattern generating
part 13 in Fig. 1 has stored therein a plurality of
patterns each composed of a plurality of normal random
numbers with a mean 0 and a variance 1. A gain
calculating part 15 calculates, for each random pattern,
a gain 9, which makes equal the power of the synthesized
speech Sp'(t) by the output random pattern and the power
of the phase-equalized speech Sp(t), and a scalar-
quantized gain 9i by a quantizer 16 is used to control an
amplifier 14. Next, a matching error between a
synthesized speech waveform Sp'(t) obtained by applying
each of all the random patterns to the all-pole filter 18
and the phase-equalized speech Sp'(t) is obtained by the
waveform matching error calculating part 19. The error~
thus obtained are decided by the error deciding part 20
and the random pattern generating part 13 searches for an
optimum random pattern which minimizes the waveform
matching error. In this embodiment one frame is composed
of three successive random patterns. This random pattern
sequence is applied as the excitation signal to the all-
pole filter 18 via the amplifier 14.
Following the above procedure, the speech signal
is represented by the linear prediction coefficients ai
and the voiced/unvoiced sound parameter W; the voiced
sound is represented by the impulse positions ti, the
impulse magnitudes ~i and zero filter coefficients vi,
and the unvoiced sound is represented by the random
number code pattern (number) ci and the gain 9i These
speech parameters are coded by a coding part 21 and then
transmitted or stored. In a speech synthesizing part the
speech parameters are decoded by a decoding part 22. In

2~66~
-29-



the case of the voiced sound, an impulse sequence
composed of the impulse positions ti and the impulse
magnitudes ~i is produced in an impulse sequence
generating part 23 and is applied to a zero filter 24 to
create an excitation signal. In the case of the unvoiced
sound, a random pattern is selectively generated by a
random pattern generating part 25 using the random number
code (signal) ci and is applied to an amplifier 26 which
is controlled by the gain gi and in which it is
magnitude-controlled to produce an excitation signal.
Either one of the excitation signals thus produced is
selected by a switch 27 which is controlled by the
voiced/unvoiced parameter W and the excitation signal
thus selected is applied to an all-pole filter 28 to
excite it, providing a synthesized speech at its output
end 29. The filter coefficients of the zero filter 24
are controlled by v~ and the filter coefficients of the
all-pole filter 28 are controlled by al.
In a first modified form of the above embodiment
the impulse excitation source is used in common to voiced
and unvoiced sounds in the construction of Fig. 1. That
is, the random pattern generating part 13, the amplifier
14, the gain calculating part 15, the quantizer 16 and
the switch 17 are omitted, and the output of the zero
filter 10 is applied directly to the all-pole filter 18.
This somewhat impairs speech quality for a fricative
consonant but permits simplification of the structure for
processing and affords reduction of the amount of data to
be processed; hence, the scale of hardware used may be
small. Moreover, since the voiced/unvoiced sound
parameter need not be transmitted, the bit rate is
reduced by 60 bits per second.
In a second modified form, the zero filter 10 is

~Q2~6~0
-30-



not included in the impulse excitation source in Fig. 1,
that is, the zero filter 10, the zero filter coefficient
calculating part 11 and the quantizer 12 are omitted, and
the output of the impulse sequence generating part 7 ia
provided via the switch 17 to the all-pole filter 18.
(The zero filter 24 is also omitted accordingly.) With
this method, the natural sounding property of the
synthesized speech is somewhat degraded for a speech of a
male voice of a low pitch frequency, but the removal of
the zero filter 10 reduces the scale of hardware used and
the bit rate is reduced by 600 bits per ~econd which are
needed for coding filter coefficients.
In a third modified form, processing by the
impulse magnitude calculating part 8 and processing by
the vector quantizing part 9 in Fig. 1 are integrated for
calculating a quantized value of the impulse magnitudes.
Fig. 9 shows the construction of this modified form. A
frequency weighting filter processing part 50, an impulse
response calculating part 51, a correlation calculating
part 52 and another correlation calculating part 53 are
identical in construction with those in Fig. S. In an
impulse magnitude (vector) quantizing part 52, for each
impulse standard pattern mci (where i = 1, 2, ..., Nc), a
mean square error between a speech waveform synthesized
using the magnitude standard pattern and the phase-
equalized input speech waveform Sp(t) is calculated, and
an impulse magnitude standard pattern is obtained which
minimizes the error. A distance calculation is performed
by the following equation:
d = mClt~mci ~ 2mci ~'
where ~ is a matrix using the covariance ~(i, j) of the
impulse response f(t) as matrix elements and ~ is a
column vector using, as its elements, the cross

~266~
-31-



correlation ~(i) (where i = l, 2, ..., np) of the impulse
response and the output Sw(t) of the frequency weighting
filter processing part 50.
The structures shown in Figs. 6 and 9 are nearly
equal in the amount of data to be processed for obtaining
the optimum impulse magnitude, but in Fig. 9 processing
for solving the simultaneous equations included in the
processing of Fig. 6 is not required and the processor is
simple-structured accordingly. In Fig. 6, however, the
maximum value of the impulse magnitude can be scalar-
quantized, whereas in Fig. 9 it is premised that the
vector quantization method is used.
It is also possible to calculate quantized
values of coefficients by integrating the calculation of
the coefficients v~ of the zero filter 10 and the vector
quantization by the quantizer 12 in the same manner as
mentioned above with respect to Fig. 9.
In a fourth modified form of the Fig. l
embodiment, the impulse position generating part 6 is not
provided, and consequently, processing shown in Fig. 4 is
not involved, but instead all the reference time point~
t'~ provided from the phase equalizing-analyzing part 4
are used as impulse positions t~. This somewhat
increases the amount of information necessary for coding
the impulse positions but simplifies the structure and
speeds up the processing. Yet, the throughput for
enhancing the quality of the synthesized speech by the
use of the zero filter 10 may also be assigned for the
reduction of the impulse position information at the
expense of the speech quality.
It is evident that in the embodiments of the
speech analysis-synthesis apparatus according to the
present invention, their functional blocks shown may be

-32- 2 02~ 6



formed by hardware and functions of some or all of them
may be performed by a computer.
To evaluate the effect of the speech analysis-
synthe~is method according to the present invention,
experiments were conducted using the following
conditions. After sampling a speech in a 0 to 4 kHz band
at a sampling frequency 8 kHz, the speech signal is
multiplied by a Hamming window of an analysis window 30
ms long and a linear predictive analysis by an auto-
correlation method is performed with the degree ofanalysis set to 12, by which 12 prediction coefficients
ai and the voiced/unvoiced sound parameter are obtained.
The processing of the excitation parameter analyzing part
30 is performed for each frame 15 m~ (120 speech samples)
equal to half of the analysis window. The prediction
coefficients are quantized by a differential multiple
stage vector quantizing method. As a distance criterion
in the vector quantization, a frequency weighted cepstrum
distance was used. When the bit rate is 4.8 kb/s, the
number of bits per frame is 72 bits and details are as
follows:





2~ 4~
-33-




Number of
Parameters
bits/Frame
Prediction coefficients 24
Voiced/unvoiced sound parameter
Excitation source (for voiced sound)
Impulse positions 29
Impulse magnitudes 8
Zero filter coefficients 10
Excitation source (for unvoiced sound)
Random patterns 27 (9 x 3
Gains 18 ((5+1) x 3)

The constant J representing the allowed limit of
fluctuations in the impulse frequency in the impulse
source, the allowed maximum number of impulses per frame,
Np, and the allowed minimum value of impulse intervals,
Lmin, are dependent on the number of bits assigned for
coding of the impulse positions. In the case of coding
the impulse positions at the rate of 29 bits/frame, it i~
preferable, for example, that the difference between
adjacent impulse interval~, AT, be equal to or smaller
than 5 samples, the maximum number of impulses, Np, be
equal to or smaller than 6 samples, and the allowed
minimum impulse interval Lmin be equal to or greater than
13 samples. A filter of degree 7 (q = r = 1) was used as
the zero filter 10. The random pattern vector c~ is
composed of 40 samples (5 ms) and is selected from 512
kinds of patterns (9-bit). The gain g; is scalar-
quantized using 6 bits including a sign bit.
The speech coded using the above conditions is
far natural sounding than the speech by the conventional

-34~ 6~



vocoder and its quality is close to that of the original
speech. Further, the dependence of speech quality on the
speaker in the present invention is lower than in the
case of the prior art vocoder. It has been ascertalned
that the quality of the coded speech is apparently higher
than in the cases of the conventional multipulse
predictive coding and the code excited predictive coding.
A spectral envelope error of a speech coded at 4.8 kb/s
is about 1 dB. A coding delay of this invention i9 45
ms, which is equal to or shorter than that of the
conventional low-bit rate speech coding schemes.
A short Japanese sentence uttered by two men and
two women was speech-analyzed using substantially the
same conditions as those mentioned above to obtain the
excitation parameters, the prediction coefficients and
the voiced/unvoiced parameter W, which were then used to
synthesize a speech, and an opinion test for the
subjective quality evaluation of the synthesized speech
was conducted by 30 persons. In Fig. 10 the results of
the test are shown in comparison with those in the cases
of other coding methods. The abscissa represents MOS
(Mean Opinion Score) and ORG the original speech. PCM4
to PCM8 represent synthesized speeche~ by 4 to 8-bit
Log-PCM coding methods, and EQ indicates a phase-
equalized speech. The test results demonstrate that the
coding by the present invention is performed at a low bit
rate of 4.8 kb/s but provides a high quality synthesized
speech equal in quality to the synthesized speech by the
8-bit Log-PCM coding.
According to the present invention, by
expressing the excitation signal for a voiced sound as a
quasi-periodic impulse sequence, the reproducibility of
8peech waveform information is higher than in the

-35- 2~2~ 40



conventional vocoder and the excitation signal can be
expressed with a smaller amount of information than in
the conventional multipulse predictive coding. Moreover,
since an error between the input speech waveform and the
phase-equalized speech waveform is used as the criterion
for estimating the parameters of the excitation signal
from the input speech, the present invention enhances
matching between the synthesized speech waveform and the
input speech waveform as compared with the prior art
utilizing an error between the input speech itself and
the synthesized speech, and hence permits an accurate
estimation of the excitation parameters. Besides, the
zero filter produces the effect of reproducing fine
spectral characteristics of the original speech, thereby
making the synthesized speech more natural sounding.
It will be apparent that many modifications and
variations may be effected without departing from the
scope of the novel concepts of the present invention.





Representative Drawing

Sorry, the representative drawing for patent document number 2026640 was not found.

Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee  and Payment History  should be consulted.

Administrative Status

Title Date
Forecasted Issue Date 1996-07-09
(22) Filed 1990-10-01
Examination Requested 1990-10-01
(41) Open to Public Inspection 1991-04-03
(45) Issued 1996-07-09
Deemed Expired 2009-10-01

Abandonment History

There is no abandonment history.

Payment History

Fee Type Anniversary Year Due Date Amount Paid Paid Date
Application Fee $0.00 1990-10-01
Registration of a document - section 124 $0.00 1991-05-03
Maintenance Fee - Application - New Act 2 1992-10-01 $100.00 1992-08-27
Maintenance Fee - Application - New Act 3 1993-10-01 $100.00 1993-07-21
Maintenance Fee - Application - New Act 4 1994-10-03 $100.00 1994-08-02
Maintenance Fee - Application - New Act 5 1995-10-02 $150.00 1995-07-19
Maintenance Fee - Patent - New Act 6 1996-10-01 $150.00 1996-08-13
Maintenance Fee - Patent - New Act 7 1997-10-01 $150.00 1997-07-31
Maintenance Fee - Patent - New Act 8 1998-10-01 $150.00 1998-07-29
Maintenance Fee - Patent - New Act 9 1999-10-01 $150.00 1999-07-29
Maintenance Fee - Patent - New Act 10 2000-10-02 $200.00 2000-08-03
Maintenance Fee - Patent - New Act 11 2001-10-01 $200.00 2001-08-09
Maintenance Fee - Patent - New Act 12 2002-10-01 $200.00 2002-07-17
Maintenance Fee - Patent - New Act 13 2003-10-01 $200.00 2003-08-20
Maintenance Fee - Patent - New Act 14 2004-10-01 $250.00 2004-09-09
Maintenance Fee - Patent - New Act 15 2005-10-03 $450.00 2005-09-09
Maintenance Fee - Patent - New Act 16 2006-10-02 $450.00 2006-09-13
Maintenance Fee - Patent - New Act 17 2007-10-01 $450.00 2007-07-16
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
NIPPON TELEGRAPH & TELEPHONE CORPORATION
Past Owners on Record
HONDA, MASAAKI
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Description 1994-03-26 35 1,400
Description 1996-07-09 35 1,383
Drawings 1996-07-09 8 154
Cover Page 1994-03-26 1 16
Abstract 1994-03-26 1 28
Claims 1994-03-26 7 255
Drawings 1994-03-26 8 175
Cover Page 1996-07-09 1 16
Abstract 1996-07-09 1 28
Claims 1996-07-09 6 242
PCT Correspondence 1996-05-06 1 50
Prosecution Correspondence 1995-11-15 1 48
Prosecution Correspondence 1994-12-29 1 37
Prosecution Correspondence 1994-12-09 5 245
Prosecution Correspondence 1993-03-23 3 130
Office Letter 1991-03-12 1 29
Examiner Requisition 1995-07-17 1 62
Examiner Requisition 1994-08-11 2 95
Examiner Requisition 1992-12-21 1 69
Fees 1996-08-13 1 62
Fees 1995-07-19 1 68
Fees 1994-08-02 1 64
Fees 1993-07-21 1 44
Fees 1992-08-27 1 33