Patent 1180813 Summary

(12) Patent:	(11) CA 1180813
(21) Application Number:	268804
(54) English Title:	SPEECH RECOGNITION APPARATUS
(54) French Title:	APPAREIL DE RECONNAISSANCE DE LA PAROLE
Status:	Expired

Bibliographic Data

(52) Canadian Patent Classification (CPC):	354/54
(51) International Patent Classification (IPC):	G10L 15/08 (2006.01) G10L 15/10 (2006.01)
(72) Inventors :	MOSHIER, STEPHEN L. (United States of America)
(73) Owners :	EXXON CORPORATION (Not Available)
(71) Applicants :
(74) Agent:	GEORGE H. RICHES AND ASSOCIATES
(74) Associate agent:
(45) Issued:	1985-01-08
(22) Filed Date:	1976-12-29
Availability of licence:	N/A
(25) Language of filing:	English

Patent Cooperation Treaty (PCT):	No

(30) Application Priority Data:	None

Abstracts

English Abstract

ABSTRACT OF THE DISCLOSURE
In the speech recognition apparatus disclosed herein,
an audio signal is digitized and a succession of short-term
power spectra are generated over a time interval corresponding
to a spoken word. The short-term power spectra are frequency
band equalized as a function of the peak amplitude occurring in
each frequency band over the word interval. The changes in
amplitude in each frequency band are weighted and summed to
obtain a cumulative measure of subjective time and then a
limited number of frequency band equalized spectra are selected
as representing equal intervals of subjective time so as to
supress variations in rate of articulation. The selected
spectra are then non-linearly scaled in amplitude and transform-
ed so as to maximize the separation between phonetically
different sounds. By means of a maximum-likelihood method, the
transformed selected spectra are compared with a data base
representing a vocabulary to be recognized.

Claims

Note: Claims are shown in the official language in which they were submitted.

The embodiments of the invention in which an exclusive
property or privilege is claimed are defined as follows:

1. In a speech analysis system in which an audio signal
is analyzed over an interval corresponding to a spoken word to
determine the behavior of formant resonances relative to a
sequence of reference vectors representing a preselected word,
a method of selecting sample points within said interval
comprising:
repeatedly over said interval, evaluating a set of
parameters corresponding to the energy spectrum of said signal
at that time, each such set of values being characterizable as
a vector having a coordinate corresponding to each parameter;
summing over the said set of parameters the magnitudes
of the values of the changes that occur between successive
evaluations of each parameter, thereby to obtain a value
corresponding to the arc length increment traversed by the
multi-coordinate vector during the subinterval between successive
evaluations;
accumulating the arc length increments over successive
subintervals so as to obtain a sequence of arc lengths through-
out the said interval and a total arc length for the said
interval;
dividing the total arc length into a sequence of equal
length segments corresponding in number to the number of vectors
in the sequence of reference vectors;
separating said sequence of arc lengths into groups,
the cumulative arc length for each group being substantially
equal to said equal length segments; and
for each segment, selecting a set of parameter values
defining a representative vector from the vectors associated
with the corresponding group of arc lengths and comparing the
selected set with the parameter values defining the corresponding

-21-

Claim 1 continued

recognition vector, the several comparisons so performed being
indicative of the match between the audio signal and the speech
corresponding to the recognition vectors.

2. A speech analysis system as set forth in claim 1
wherein the magnitudes of the changes that occur between
successive evaluations of each parameter are multiplied by a
respective predetermined weighting factor prior to summing the
set of parameters, thereby to emphasize the importance of changes
in certain of the parameters and to de-emphasize changes in
other parameters.

3. In a speech analysis system in which an audio signal
is analyzed over an interval corresponding to a spoken word to
determine the behavior of formant resonances relative to a
sequence of reference vectors representing a preselected word,
a method of obtaining and selecting sample points within said
interval comprising:
repeatedly within said interval, evaluating a set of
parameters determining the short-term power spectrum of said
audio signal in a subinterval within the said interval, thereby
to generate a sequence of short-term power spectra;
for each parameter in the set, determining the maximum
value of the parameter occurring over the inverval, the set of
maximum values thereby determined corresponding to a peak spect-
rum over the interval;
smoothing the peak spectrum by averaging each maximum
value with values from said set of maximum values corresponding
to adjacent frequencies, the width of the band of frequencies

-22-

Claim 3 continued

contributing to each averaged value being approximately equal to
the typical frequency separation between fromant frequencies;
for each short-term power spectrum in said sequence
of spectra, dividing the value for each parameter in the set by
the corresponding smoothed maximum value in the smoothed peak
spectrum, thereby to generate over said interval a sequence of
frequency band equalized spectra corresponding to a compensated
audio signal having the same mazimum short-term energy content
in each of the frequency bands comprising the spectrum, each
such set of equalized parameters being characterizable as a
vector having a coordinate corresponding to each parameter;
summing over the said set of equalized parameters the
magnitudes of the values of the changes that occur between
successive evaluations of each equalized parameter, thereby to
obtain a value corresponding to the arc length increment
traversed by the multi-coordinate vector during the subinterval
between successive evaluations;
Accumulating the arc length increments over successive
subintervals so as to obtain a sequence of arc lengths through-
out the said interval and a total arc length for the said
interval;
dividing the total arc length into a sequence of equal
length segments corresponding in number to the number of vectors
in the sequence of reference vectors;
separating said sequences of arc lengths into groups,
the cumulative arc length for each group being substantially
equal to said equal length segments; and
for each segment, selecting a set of equalized parame-
ter values defining a representative vector from the vectors
associated with the corresponding group of arc lengths and
comparing the selected set with the parameter values defining

-23-

Claim 3 continued....
the corresponding reference vector, the several comparisons
so performed being indicative of the match between the audio
signal and the speech corresponding to the reference vectors.

4. In a speech analysis system in which an audio signal
is spectrum analyzed to determine the behavior of format
resonances over an interval of time, a frequency compensation
and amplitude scaling method comprising:
repeatedly within said interval, evaluating a set of
parameters determining the short-term power spectrum of said
audio signal in a subinterval within the said interval, thereby
to generate a sequence of short-term power spectra;
for each parameter in the set, determining the maximum
value of the parameter occurring over the interval, the set of
maximum values thereby determined corresponding to a peak
spectrum over the interval;
smoothing the peak spectrum by averaging each maximum
value with values from the set of maximum values corresponding
to adjacent frequencies, the width of the band of frequencies
contributing to each averaged value being approximately equal
to the typical frequency separation between formant frequencies,
for each short-term power spectrum in said sequence
of spectra, dividing the value for each parameter in the set by
the corresponding smoothed maximum value in the smoothed peak
spectrum, thereby to generate for each spectrum, a corresponding
frequency band equalized spectrum comprising a set of equalized
parameters S(f);
generating a value A corresponding to the average of
said set of N values, where

Image

and Fo represents the width of each frequency band; and

-24-

Claim 4 continued...
non-linearly scaling each spectrum by generating, for
each value S(f) in each frequency band equalized spectrum, a
corresponding value Ss(f), where

Image

5. In a speech recognition system, a method of comparing
the spectrum of an audio signal representing speech with a vector
of recognition coefficients (ai,bi,c), said method comprising:
generating a set of parameters S(f) corresponding to
the short-term power spectrum of said signal, each parameter
representing the energy in a corresponding frequency band f;
generating a value A corresponding to the average of
said set of M parameters, where

Image

and Fo represents the width of each frequency band; for each
parameter in said set, generating a corresponding non-linearly
scaled value Ss(f), where

Image

generating from these values a set of linearly scaled
values Lk, where

Image

where the constant coefficients Pjk enhance the
phonetic attributes of the processed speech and are independent
of the particular speech patterns represented by the coefficients
(ai,bi,c), and M equals the number of possible decision choices,
and
generating a numerical comparison value X, where

-25-

Claim 5 continued....

Image

the comparison value being indicative of the match
between the audio signal and the speech represented by the
recognition coefficients.

6. A speech recognition system as set forth in claim 5
wherein the set of parameters S(f) is generated repeatedly over
an interval corresponding to at least one spoken word, each such
set of parameters being characterizable as a vector having a
coordinate corresponding to each parameter;
summing over the set of parameters the magnitudes of
the values of the changes that occur between successive evalua-
tions of each parameter, thereby to obtain a value corresponding
to the arc length increment traversed by the multi-coordinate
vector during the subinterval between successive evaluations;
accumulating the arc length increments over successive
subintervals so as to obtain a sequence of arc lengths through-
out the said interval and a total arc length for the said
interval;
dividing the total arc length into a sequence of equal
length segements corresponding in number to the number of vectors
in the sequence of reference vectors;
separating said sequence of arc lengths into groups
the cumulative arc length for each group being substantially
equal to said equal length segments; and
for each segment, selecting said set of parameter
values S(f) defining a representative vector from the vectors
associated with the corresponding group of arc lengths and
comparing the selected set with the parameter values defining
the corresponding recognition vector, the several comparisons

-26-

Claim 6 continued

so performed being indicative of the match between the audio
signal and the speech corresponding to the recognition vectors.

7. A speech recognition system as set forth in claim 6
wherein the set of parameters S(f) is generated repeatedly
within an interval corresponding to at least one spoken word;
for each parameter in the set, determining the maximum
occurring over the interval, the set of maximum values thereby
determined corresponding to a peak spectrum over the interval;
smoothing the peak spectrum by averaging each maximum
value with values corresponding to adjacent frequencies, the
width of the band of frequencies contributing to each averaged
value being approximately equal to the normal frequency
separation between formant frequencies; and
for each set S(f), dividing each parameter therein by
the corresponding smoothed maximum value in the smoothed peak
spectrum, thereby to generate a set of frequency band equalized
spectra corresponding to a frequency compensated audio signal
over said interval.

8. A speech recognition system as set forth in claim 5
wherein the set of values S(f) is generated repeatedly within
an interval corresponding to at least one spoken word;
for each value in the set, determining the maximum
occurring over the interval, the set of maximum values thereby
determined corresponding to a peak spectrum over the interval;
smoothing the peak spectrum by averaging each peak
value with the values corresponding to adjacent frequencies,
the width of the band of frequencies contributing to each

-27-

Claim 8 continued

averaged value being approximately equal to the normal frequency
separation between formant frequencies; and
for each set S(f), dividing each value therein by the
corresponding value in the smoothed peak spectrum, thereby to
generate a set of frequency equalized spectra corresponding to
the energy content of said audio signal over said interval, the
values in the equalized spectra being utilized to generate the
non-linearly scaled values Ss(f).

-28-

Description

Note: Descriptions are shown in the official language in which they were submitted.

1 _ckg_ und of the I~ enti.on

The present invention relates to speech recognition
apparatus and more particularly to such apparatus in which
sequentially generated spectra are equalized and selected to
improve accuracy of recognition upon compari.son with data
representing a vocabulary to be recognized.
Various spe~ech recognition systems have been proposed
heretofore including those which attempt to recognize phonemes
and attempt to recog:nize and determine the pattern of behavior
1~ of ~ormant frequencies within speech. While these prior art
techniques have achi~eved various measures of success, sub-
stantial problems exist. For example, the vocabularies which
can be recognized are limited; the recognition accuracy is
highly sensitive to Idifferences between the voice characteristics
of different talkers; and the systems have been highly sensitive
to distortion in the speech signal being analyzed. This la~ter
problem has typically precluded the use of such automatic
speech recognition systems on speech signals transmitted over
ordinary telephone apparatus, even though such signals were
easily capable of being recognized and understood by a human
observer.
~ mong the objects of the present invention may be
noted the provision of speech recognition apparatus providing
improved accuracy of recognition; the provision of such
apparatus which is relatively insensitive to frequency
distortion of the speech siynal to be recognized; the provision
o:E such a system which is relatively insensitive to variations
i.n speaking rate in the siynal to be analyzed; the provision of
such a system which will respond to different voices; and the
provision of such apparatus which is of hi.ghly reliable and

1 relatively simple and inexpensive ins-truction. Other objects
and features will be in part apparent and in part pointed out
hereinafter.
Summary of the Invention

. _ _
The speech recoynition system o~ the present invention
spectrum analyzes an audio signal to determine the behavior of
formant frequencies over an interval of time corresponding to
a spoken word or phrase. Repeatedly within the interval, a
short-term power spectrum is generated representing the ampli-

tude or power spectrum of the audio signal in a brief sub-
interval. For each frequency band in the short-term spectra,
the maximum'value occurring over the interval is determined,
thereby obtaining a peak power spectrum over the interval.
This peak spectrum is smoothed by averaging each maximum value
with values corresponding to adjacent frequency bands, the
width of the overall band contributing to each average beiny
approximately equal to the typical fre~uency separation between
formant frequencies (about 1000 Hz). For each of the originally
obtained sequence of short-term power spectra r the amplitude
~ value of each frequency band is divided by the corresponding
value in the smoothed peak spectrum, thereby generating a
corresponding sequence of frequency-equalized spectra~ Compari-
son of a selected group of these frequency band equalized
spectra with a data base identifying a known vocabulary provides
improved recognition when the original speech signal has been
subject to frequency distortion, e.g. by a telephone line
transmission.
Brief Description of the DraWings

... .. ..
Fig. 1 is a flow chart illustra-ting the general
sequence of operations performed in accordance with the practice
of the present invention;

,. ,

1 Fig. 2 is a schematic block diagram oE electronic
apparatus performing certain initial operations in the overall
process illustrated in Fig. l; and
Fig. 3 i5 a flow diagram of a digital computer program
performing certain subsequent procedures in the process of
Fig. 1.
Corresponding reference characters indicate corresponding
parts throughout the several views of the drawings.

Description of the Preferred Embodiment
1~
In the particular preferred embodiment which is
described herein, speech recognition is performed by an overall
apparatus which involves both a specially constructed electronic
system for effecting certain analog and digital processing of
incoming speech signals and also a general-purpose digital
computer which is programmed in accordance with the present
i~vention to effect certain data reduction steps and numerical
evaluations. The division of tasks between the hardware portion
and the software portion of this system has been made so as to
obtain an overall system which can accomplish speech recognition
in real time at moderate cost. Elowever, it should be understood
that some of the tasks being performed in hardware in this
particular system could as well be performed in software and
that some of the tasks being performed by software programming
in this exampLe might also be performed by special-purpose
circuitry in a dif~erent embodiment of the inver-tion.
The successive operations performed by the presen-t
system in recogniæing speech signals are illustrated in a
general way in Fig. 1. It is useful, in connection with this
3~ initial description, to also fol:Low through the various data
rates which are involved so as to facilitate an understandiny of

- 3 -

1 the detailed operation described hereinafter. As indica~ed
previously, one aspect of the present invention is the provision
oE apparatus which will recogni~e speech signals, even though
those signals are frequency distorted, e.g. as by a telepllone
line. Thus, in Fig. 1, the voice input signal, indicated at
11, may be considered to be a voice signal received over a
telephone line encompassing any arbitrary distance or number of
switchiny interchanges~
As will become apparent in the course of the following
description, the present method and apparatus are concerned
with the recognition of speech segments containing a sequence
of sounds or phonemes. In the following description and in the
claims, reference is made to "an interval corresponding to a
spoken word" since this is a convenient way of expressing a
minimum length of time appropriate for encompassing a recognizable
sequence of sounds. This term, however, should be broadly and
gënerically construed so as to encompass a series of words in
the gramrnatical sense, as well as a sinyle word.
In the particular implementation illustrated, the
interval corresponding to a spoken word is taken somewhat
arbitrarily as a one second interval~ Various techniques are
known in the art for determining when to start or initiate this
interval. In general, the particular technique used forms no
part of the present invention. However, it is at present
preferred that the interval be initiated when the input
si.gnal power, calculated as described hereinafter, exceeds a
preset threshold more than half the time in a sliding window
o~ about 30 consecutively generated spectra of the voice signal,

digiti.zed as described hereinafter.
After being amplitude normalized by an analog a.y.c.

'~j1
$

1 circuit, the voice siynal is digitized, that is, the siynal
amplitude is converted to digital form. In the present example,
an 8-bit binary representation of the signal amplitude is
generated at a rate of 10,000 conversions per second. An
autocorrelator 17 processes this input signal to generate an
autocorrelation function 100 times per second, as indicated at
19. Each autocorrelation function comprises 32 values or
channels, each value being calculated to a 24-bit resolution.
The autocorrelator is described in greater detail hereinafter
with reference to Fig. 2.

The autocorrelation functions 19 are sub~ected to
Fourier transformation, as indicated at 21, so as to obtain
corresponding power spectra 23. These "original" spectra are
calculated at the same repetition rate as the autocorrelation
functions, i.e. 32 channels each having a resolution of 16 bits.
As will be understood, each of the 32 channels in each spectrum
represents a frequency band. In the present embodiment, the
Fourier transformation, as well as subsequenk processing steps,
are performed under the control of a general-purpose digital
~ computer, appropriately programmed, utilizing a peripheral
array processor for speeding the arithmetic operations required
r~peatedly in the present method. The particular computer
employed is a model PDPll* manufactured by the Digital Equipment
Corporation of Maynard, Massachusetts~ and the programming
described hereinafter with reference ko Fig. 3 is substantially
predicated upon the capabilities and characteristics of that
commercially available computer.
Each of the successive short-term power spectra are

frequency band equalized, as indicated at 25, this equalization
~eing performed as a function of the peak amplitude occurring in

each frequency band over the interval as described in greater
detail hereinafter. Again, the equalized spectra, designated 26,
-- 5 --

* Trade Mark

3~

1 are generated at the rate of lO0 per secorld,each frequenc~ band
- equali~ed spectra having 32 channels evaluated to lfi-bit binary
accuracy~
In order to compensate for differences in speakin~
rate, the system then performs a redistribution or compensation
predicated on the passage of subjective time. While this
compensation is described in greater detail hereinafter, it Inay
for the present be noted that this evaluation consists essenti-
ally of the accumu~ation of the magnitude of all amplitude
~0 changes in all of the different frequency channels over the
interval of interest~ This accumulation is performed at 29.
Since the recognition of speech to some extent depends upon the
way in which formant frequencies shift, it can be seen that the
rate of shift is indicative of the speaking rate. Furthex, such
shi-ts will be reflected in changes in the amplitudes in the
frequency channels involved.
The sub~ective time evaluation provides a basis for
seléction of a limited number of the frequency band equalized

spectra within the interval, which selected samples are fairly
representative of the spoken word. As indicated previously, the

short~term power spectra themselves are generated at the rate of
lO0 per second. As will be unders-tood, however, much o~ the
data i5 redundant. In the practice of the present invention, it
has been found that 12 of the frequency band equalized spectra
provide an adequate representation of a short word or sequence
of phonemes, appropriate for recognition purposes. The subjec-
tive time evaluation is therefore employed to divide the entire
interval ~approximately one second) into 12 periods of equal

subjective time value and to select a corresponding short-term
power spectra for each such period, the selection being performed
at 3I.

. ,~ ,. ., ~,

~8~3

1 In order to ~acilitate the final eval,uation o~ the
spoken word, the amplitude values of the selected spectra are
subjected to a non-linear scalar trans~ormation, as i.ndicated
at 35. This transformation is de.scribed i.n greater de-tail
he.reina~ter but it may be noted at this point that this trans-
formation improves the accuracy with which an wnknown speech
signal may be matched with a reference vocabulary. In the
embodiment illustrated, this transformation i5 performed on all
o the frequency band equilized spectra, in parallel with the
~0 accumulation which evaluates subjec~ive time and prior to the
selection of representative samples. This actual comparison
of the selected spectra with the data base is per~ormed a~ter ~
vector transformation; indicated at 37, the product of tlle vector
transformation being applied to a likelihood evaluator indicated
at 41.
Preprocessor
._
In the apparatus illustrated in Fig. 2, an autocorrela-
. tion function and an averaging function are per~ormed digit~lly

on a data stream generated by the analog-to-digi.tal converter
13'wh'ich digitizes the analog voice signal 11. The digital

processing functions, as well as the input analoc3-to-digital
conversion are timed under the control of a clock oscillator
51n Clock oscillator 51 provides a basic tim,ing signal at
320,000 pulses per second and this signal is appliea to a
frequency divider 52 so as to obtain a second timing signal
at 10,000 pulses per second. This slower timing signal contro~s
the'analog-to-digital converter 13 together with a latch 53
which:holds the 8-bit results of the last conversion unti] the
next conversion is completed~ Pri.or to being applied to the

3~ latch,' the'digital value is converted to

'! ~ "~1

1 a sign magnitude representation, as indicated at 54, frorn the
usual representation provided by conventional analog digital
converters, such as that indicated at 13.
The autocorrelation products clesired are generated by
a digital mu].tiplier 56 together with a 32 word shift register
58 and appropriate control circuitry. The shift register 58
is operated in a recirculating rnode and is driven by the faster
clock frequency so that one complete circulation of data is
accomplished for each analog-to-digital conversion. One input
to the digital multiplier 56 is taken from the latch 53 while
the other input to the multiplier is taken from the currer~t
output of the shift register, the multiplications being per-
formed at the higher clock frequency. Thus, each value obtained
from the conversion is multiplied with each of the preceding
31 conversion values. As will be understood by those skilled
in the art, the signals thereby generated are equivalent to
multiplying the input signal by itself delayed in time by 32
different time increments. To produce the zero-delay
correlation ~i.e. the power), a multiplexer 59 causes the current
value to be multiplied by itself at the time each new value
is being introduced into the shift register, this timing
function being indicated at 60.
A~ will also be understood by those skilled in the art,
the products from a single conversion together with its 3]
predecessors will not be fairly representative of the energy
distribution or spectrum of the signal over any reasonable
sampling interval. Accordingly, the apparatus of Fig. 2 provides
for the averaging of these sets oE products,

To facilitate the addit.ive process of averaging, the
sign/magnitude/binary representation of the individual auto-

1 correlation products generated b~ multiplier 56 is converted to
a two's-complement code as indica~ed at 61. The accumulation
process which effects averaging is provided by a 32-word shift
reqlster 63 which is interconnected with an adder ~5 so as to
form a set of 32 accumulators. Thus, each word can be recirculated
after having added to it the corresponding increment from the
digital multiplier. The circulation loop passes through a
gate 67 which is controlled by a divider circuit 69 driven by

the lower frequency clock signal. The divider 69 divides the
lower frequency clock signal by a factor which determines the

number of instantaneous autocorrelation functions which are to
be accumulated or averaged before the shift register 63 is read
out.
In the preferred example, it is assumed that 100
samples are accumulated before being read out. In other words,
N for the divide-by-N divider is one hundred. After 100 samples
have thus been transformed and accumulated, the timing circuit
69 triggers a computer interrupt circuit 71. At this time,
the contents of the shift register 63 are read into the
~ computer's memory through suitable interface circuitry 73, the
32 successive words in the register being presented successively
to the interface~ As will be understood by those skilled in the
art, this reading in of data may be typically performed by a
direct memory access procedure. Predicated on the averaging
of lQO samples, and an initial sampling rate of 10,000 per
second, it will be seen that 100 averaged autocorrelation
functions will be provided to the computer every second. While
t-he shiEt register contents are being read out to the computer,

the ga-te 67 is closed so that each of the words in the shift
reyister i.5 effectlvely reset back to zero to permit the accu-
mulation to begin again.

_ g _

1 E~pressed in mathematical terms, the op~ration o~ the
apparatus shown in Fig. 2 may he described as follo~s. Assuming
the analog-to-digital converter generates the time series S(t),
S(t-T), S(t-2T)...the digital correlator circuitry of Eig. 2
may be considered to compute the autocorrelation function

100
~(j,t) = ~ S(t-kT) S (t-[k-j]T)
k=l
After an interval correspondlng to a spoken word, the

digital correlator will have transferred to the computer a
~0
series of data blocks representing the spoken word. Assuming
that the interval of interest is in the order of one second,
there will be 100 blocks of data, each comprising 32 words of
24 bits each. Fuxther, each block of data represents an
autocorrelation function derived from a corresponding sub-
interval of the overall interval under consideration. In the
embodiment illustrated, the processing of this data from this
point on in the system is performed by a general-purpose digital
computer, appropriately programmed, The flow chart which
includes the function provided by the computer program is given
in Fig. 3. Again, however, it should be pointed out that
various of these steps could also be performed by hardware rather
than software and that, likewise, certain of the functions
performed by the apparatus of Fig. 2 could additionally be
performed in the software by corresponding revision of the
flow chart of Fig. 3.
Although the digital correlator of Fig. 2 performs some
time averaging of the autocorrelation functions generated on
an instantaneous basis, the averaged autocorrelation functions
read out to the computer may still contain some anomalous
discontinuities or unevennesses which might interfere with

-- 10 --

1 orderly processing and evaluation of the samples. ~ccordinyly,
each block of data is first smoothed with respect to time, i.e.
with respect to adjacent channels defining the function, which
channels correspond to successive delay periods. This is
indicated in the flow chart of Fig. 3 at 79. The preferred
smoothing process is a two-pole convolutional procedure in
which the smoothed output ~s(j,t) is yiven by
~S~i~t)=co~ t)~cl~s(i~t-looT)+c2~s(i~t-2ooT) where ~(j,t) is
the unsmoothed input autocorrelation and l~ (j,t) is the smoothed
10 output autocorrelation for the j th value of time delay; t
denotes real time; and T denotes the time interval between
consecutively generated autocorrelation functions (equal toO.0001
second in the preferred embodimen~). The constants CO, Cl,
C2 are chosen to give the smoothing function an approximately
Gaussian impulse response with a frequency cutoff of approxi-
mately 20 Hz. As indicated, this smoothing function is
applied separately for each delay j. As indicated at 81,
a cosine Fourier transform is then applied to each auto-
correlation function so as to generate a 32-point power spectrum~
Th~ spectrum is defined as

l 31
2 ~5 ~~ t) ~ (j,t) cos 2~FojK

As will be understood, each point or value within each spectrum
represents a corresponding band of frequencies. While this
Fourier transform can be perform comple-tely within the con-
ventional computer hardware,the process is speeded considerably
if an external hardware multiplier or Fast-E'ourier-Transform

peripheral device is utilized. The construction and operation

of such modules are well known in the art, however~ and are not
3~
described in detail herein. After the cosine Fourier transform

-- 11 --

L3

1 has been applied, each of the resultiny power spectra is
smoothed~ at 83, by means of a Hamming window~ As indicated,
these functions are performed on each block of da1ta and the
program loops, as indicated at 85, until the overall word
interval, about one second, is completed~
As the successive short term power spectra represent-
ing the word interval are processed khrough the loop comprising
steps 79-85, a record is kept of the highest amplitude occurring
within each frequency band. Initally the peak amplitude occurr-

ing in the entire word is searched out or detected, as indicatedat 87. Starting at the beginning of the word (Step 88) a loop
is then run, comprising steps 89-91 which detects the peak
occurring within each frequency band and these peak values are
stored. At the end of the word interval, the peak values define
a peak spectrum. The peak spectrum is then smoothes by averaging
each peak value with values corresponding to adjacent frequencies,
the width of the overall band of frequencies contributing to the
average value being approximately equal to the typical frequency
separation between formant frequencies. This step is indicated
~Q at 93. As will be understood by those skilled in the speech
recognition art, this separation is in the order of 1000 Hz. By
avexaging in this particular way, the useful information in the
spectra, that is, the local variation in formant frequencies, is
retained whereas overall or gross emphasis in the frequency
spectxum is suppressed. The overa~l peak amplitude, determined
at step 87, is then employed to restore the peak amplitude of
the smoothed peak spectrum to a level equal to the original peak
amplitude. This step is indicated at 94 and is employed to
allow maximum utiiization of the dynamic range of the system.

- 12 -

?
"

3~ ~ ~

After obtaining the smoothed peak amplitude spectrum,
the successive individual short ~,errn spectra representing the
incoming audio siynal are frequency compensated b~ dividing the
amplitude value for each ~requency band within each short-term
spectrum by the corresponding value in the smoothed peak spect-
rum. This step is indicated at 99, being part o~ a loop which
processes the entire word and which comprises steps 98-102.
This then generates a sequence of frequency band equalized
spectra which emphasize changes in the frequency content of the
incoming audio signal while suppressing any generalized frequency
emphasis or distortion. This method of frequency compensation
has been found to be highly advantageous in the recognition of
speech signals transmitted over telephone lines compared with
the more usual systems of frequency compensation in which the
basis for compensation is the average power level, either in the
whole signal or in each respective frequency band.
At this point, it is useEul to point out that, while
the successive short-term power spectra have been variously
processed and equalized, the data xepresenting the spoken word
still comprises in the order o~ 100 spec-tra, each spectrum having
been normalized and frequency compensated in such a way that
shifts in individaul formant frequencies from one short-term
power spectra to another are emphasi2ed.
As in various prior art systems, the speech recognition
performed by the procedure of the present invention utilized
the patterns and shi~ts in patterns of formant frequencies to
recognize words in its vocabulary. In oraer to permit the
recognitio,n of pattern shifts even if speaking rate is varied,

the preferred embodiment of the system generates a parameter
which'may be'considered to be a measurement of subjective time.

In the'present system, a value corresponding to this parameter
ls generated relatively simply by accumulating or

;3 ' - 13 -

1 summing the absolute values of the chanye in the amplitude
oE each frequency band from one successive frequency band
equalized spectrum to the next and summing over all
the frequency bands as well. If the spectrum, valued over
32 frequency bands, is considered to be a vector in 32 dimensions,
th~ movement of the tip of this vector from one spectrum to
the next may be considered to be an increment of arc length.
Further, t~e sum of the changes in the various dimensions is
a sufficiently accurate representation of arc length for this
purpose. By accumulating the arc length increments over the
entire word interval, a cumulative arc length may be obtained.
Accordingly, when the speaker stretches out a phoneme in his
pronunciation, the accumulation of arc length will grow only
very slightly and yet will grow quickly when the speaking rate
is accelerated. The accumulation process is lndicated at 101
in Fig. 3.
Preferably, the contrîbutions from the different
frequency bands are weighted, prior to this latter summing, so
that the phonetically more significant frequencies exert a
~ greater effect. In o~her words, the magnitude of the am-
plitude change, in each frequency band, between two consecutively
evaluated spectra is multiplied by a constant weighting factor
associated with that frequency band. The weighted magnitudes of
the changes are then summed over all the frequency bands to
yield the increment of subjective time elapsed between the two
spectra.
Changes that occur in the frequency range normally
occupied by the lowest three formant resonances of the vocal
tract are found to be much more valuable in correcting for the
rate of articulation than changes at higher frequencies. In

- 14 -

..,.~ ,.

1 fact, the relative contributions at freqllencies above 2500 i~z
are so low that the weights in these frequency bands may be cet
to zero with no statistically significant effect on the results.
A table of the weighting factors, optimized for the
preferred embodiment in a particular practical application of
` the method, is presented belowO The values given are not
intended to be restrictive, and in fact the optimum values may
depend on the particulars of the spectrum analysis method
employed, the vocabulary of words to be recognized, and the
sex and age of the talkers. These values do, however, represent
an effort to reach a best compromise for talker-independent
recognition of a general English vocabulary. Tahle of weighting
factors for subjective time calculation

Frequency band Relative
Center, Hz Weighting Factor

0 0.254
159 0.261
317 0.736
476 1.000
635 0.637
794 0.377
~O 952 0.240
1111 0.26~
1270 0.377
1429 0.470
1587 0.381
1746 0.254
1905 0.181
20~3 0'079
2222 0 025
2381 0.002

When a value or parameter representing the -total arc
length is obtained, it is then divided into 12 equal increments.
For each such increment one block of data rep.resenting a
representative ~requency band equalized spectrum is selected, as
indicated at lQ5. Thus, the number of frequency band equilized
- 15 -

~ ";~:
'`~' .

1 spectra required to represent the sample intervalis reduced by a factor of about eight. However) it should be
understood that, due to ~hc so calle-l subjective time
evaluation, this ls not equivalent to selectiny one sample
for every eight spectra calculatec~ The original sampling rate
is constant with respect to absolute time hut the selected
samples will be equally spaced with respect to subjective time,
i.e. as measured in accordance with the method described above.
Either just prior to or just following the selection
process, the spectra are subjected to an amplitude trans-
formation, indicatecl at 107~ which effects a non~linear scaling.
Assuming the individual spectra to be designated as S(f,t),
where f indexes the different frequency bands and t denotes
real time, the non~linearly scaled spectrum Sl(f,t) is the
linear fraction function

S(f,t)-A
Sl(~,t)
S(f,~)+~
where A is the average value of the spectrum defined as follows:

32
~ 32

This scaling produces a soft threshold and gradual
saturation effect for spectral intensities which deviate
greatly from the short-term average A. For intensities nearer
each average, the function is approximately linear. Further
from the average, it is approximately logarithmic and at extreme
values it is nearly constant. On a logarithmic scale, the
function Sl(f/t) is symmetric about zero and the functlon
exhibits threshold and saturation behavior that is suggestive

3 of an auditory nerve firing rate function. In practice, the
overall recognltion system performs significantly better with

- 16 -

,. ~.

1 this particular ncn-linear scaling function than it does ~Jith
either a linear or a logarithmic scaling of the spec~rum
amplitudes.
A li.near matrix operation next transformseach equallzed
spectrum into a set of coefficients in which phonetic attributes
of the processed speech are enhanced. Symbolically, the
transformation applies coefficients Pi; linearly to tlle spectrum
to obtain numerical values for a set of feature data xi.

32
Xi(t) = ~1 Pij S(j~t). (1)

The coef~icients are evaluated from a sample collection
of spoken word inputs to be recognized so that the average value
- f Xi is a minimum when the input signal is in the ith pre-
defined phonetic class, while xi is as large as posslble if
the input belongs to a class other than the ith class. The
coefficients Pi; which best satisfy one or the other o~ these
criteria can be evaluated by analyzing examples of known speech
input waveforms using well-known statistical techniques of
~ linear system theory,multidimensional scaling theory, and
factor analysisO
For the purpose of evaluating the transformation
coefficients Pij, a "phonetic class" is defined to contain
whatever sound occurs at one of the séquentially numbered
selected samples of a designated word of the vocabulary to
be recognized. Even though the same nominal phoneme may occur
in different words or in different syllables of the same word,
the acoustic properties of the sound become modified, often
substantially, by the surrounding phonetic context; hence the
3~ phonetic cl~6ses employed here are context-specific.

- 17 -

It is possible to take advantage of this contextual
modi.fication by having an increased number of linear trans~
formation coefficients Pij act simultar-eously on two or more
consecutively selected spectra. This alternate procedure, while
more complex, differentiates syllables more re].ia~ly than the
phonetlc transformation differentiates phonemes.
The selected, transformed data
x = {xi(tk), i = 1,... , 32; k = 1,... , 123 (2)
are f.inally applied as inputs to a statistical likelihood
i~ calculation, indicated at 131. This processor computes a
measure of the probability that the unknown input speech matches
each of the reference words of the machine's vocabulary.
Typically, each datum xi(tk) has a slightly skew probability
densit~, but nevertheless is well approximated statistically
by a normal distribution with mean value m(i,k) and variance
~s(i,k)~2 The simples-t implementation of the process assumes
that the data ~ssociated with different values of i and k are
uncorrelated, so that the joint probability densit~ for all the
data x comprising a given spoken word input is (logarithmically~
ln p(x) = -~ ln ~2~s(i,k) - 1/2 ~xi(tk) - m(i,k)l2 (3)
., . . _
s(i,k)

which can be rewritten as~

ln p(x) = -~ ln~s(i,k)
i,k
l~k ( ~ s(.i,k~2 ) (xi(tk) m(i,k)) 2
or

X = c ~- ~ br ( Xr Mr )
where r is indexed o~er all i and k. Since the logaritham is a
3~

- 18 -

.,~ ,,~ .

1 monotonic function, this statistic is sufEicient to determine
whether the probability of a match with one vocabular~ word is
greater or less than the probability of ~ match with some other
vocabulary word. E~ch word in the vocabulary has its own set of
statistical reference parameters m~i,k), s(i,k~. Each of these
sets of parameters is compared with the set of data until the
input speech has been tested against all the words of the
vocabulary. The resulting statistical table ranks the various
vocabulary choices in accordance with their relative like]ihood
or occurrence.
The determination of Pij and the set of coefficients
(ai, bi, c) or the equivalent (mi k~ s(i,k)) is well known in
the pattern recognition art as described in Atal, Automatic
Speaker Recognition Based on Pitch Contours, JOSA, 52, pp~ 1687-

.. . . . . .. _ ... .. _
1697 (1972); and Klein et al, Vowel Spectra! Vowel Sapces, andVowel Identification, JOSA, 4~, pp. 999-1009 (1970).
As will be understood by those skilled in the art,
this ranking constitutes the speech recognition insofar as it
can be performed from single word samples. This ranking can be
utilized in various ways in an overall system depending upon
the ultimate function to be performed. In certain systems,
e.g., telephonic data entry, a simple first choice trial and
error system may be entirely adequate. In others it may be
desired to employ con-textual knowledge or conventions in order
to improve the accuracy of recognition of whole sentences. Such
modifications, however, go beyond the scope of the present
inventions and are not treated herein.
As indicated previously, a presently preferred
embodiment of the invention was constructed in which signal and

3~ data manipulation, beyond that performed by the preprocessor

-- 19 --

P.~

3~3

of Fig. 2, was implemented by a Dig.ital Equipment Corporation
PDP 11 computer.
The detailed programs which provide the functions
described in relation to the ~low chart o~ Fig. 3 do not form
part of the invention. ~t would be well withi.n the skill of
one skilled in the programming arts to prepare an appropriate
instruction list to implement the functions described in the
flow chart of Fig. 3.
In view of the foregoing, it may be seen that several
objects of the present invention are achieved and other
advantageous results have been attained.
As various changes could be ~ade in the above con-
structions without departing from the scope of the invention,
it should be understood that all matter contained in the above
description or shown in the accompanying drawings shall be
interpreted as illustrative and not in a limiting sense.

~0

- 20 -

~ 7~
'~ '

Representative Drawing

Sorry, the representative drawing for patent document number 1180813 was not found.

Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee and Payment History should be consulted.

Administrative Status

Title	Date
Forecasted Issue Date	1985-01-08
(22) Filed	1976-12-29
(45) Issued	1985-01-08
Expired	2002-01-08

Abandonment History

There is no abandonment history.

Payment History

Fee Type	Anniversary Year	Due Date	Amount Paid	Paid Date
Application Fee			$0.00	1976-12-29

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
EXXON CORPORATION

Past Owners on Record
None

Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.

Documents

To view selected files, please enter reCAPTCHA code :

To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Filter

Download Selected in PDF format (Zip Archive)

Download Selected as Single PDF

Document Description	Date (yyyy-mm-dd)	Number of pages	Size of Image (KB)
Drawings	1993-11-09	3	102
Claims	1993-11-09	8	339
Abstract	1993-11-09	1	30
Cover Page	1993-11-09	1	15
Description	1993-11-09	20	967

Language selection

Menus

English Abstract

Administrative Status

Abandonment History

Payment History

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.

Patent 1180813 Summary

English Abstract

Administrative Status

Abandonment History

Payment History

Your request is in progress.Requested information will be availablein a moment.Thank you for waiting.

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.