Note: Descriptions are shown in the official language in which they were submitted.
1 _ckg_ und of the I~ enti.on
The present invention relates to speech recognition
apparatus and more particularly to such apparatus in which
sequentially generated spectra are equalized and selected to
improve accuracy of recognition upon compari.son with data
representing a vocabulary to be recognized.
Various spe~ech recognition systems have been proposed
heretofore including those which attempt to recognize phonemes
and attempt to recog:nize and determine the pattern of behavior
1~ of ~ormant frequencies within speech. While these prior art
techniques have achi~eved various measures of success, sub-
stantial problems exist. For example, the vocabularies which
can be recognized are limited; the recognition accuracy is
highly sensitive to Idifferences between the voice characteristics
of different talkers; and the systems have been highly sensitive
to distortion in the speech signal being analyzed. This la~ter
problem has typically precluded the use of such automatic
speech recognition systems on speech signals transmitted over
ordinary telephone apparatus, even though such signals were
easily capable of being recognized and understood by a human
observer.
~ mong the objects of the present invention may be
noted the provision of speech recognition apparatus providing
improved accuracy of recognition; the provision of such
apparatus which is relatively insensitive to frequency
distortion of the speech siynal to be recognized; the provision
o:E such a system which is relatively insensitive to variations
i.n speaking rate in the siynal to be analyzed; the provision of
such a system which will respond to different voices; and the
provision of such apparatus which is of hi.ghly reliable and
1 relatively simple and inexpensive ins-truction. Other objects
and features will be in part apparent and in part pointed out
hereinafter.
Summary of the Invention
. _ _
The speech recoynition system o~ the present invention
spectrum analyzes an audio signal to determine the behavior of
formant frequencies over an interval of time corresponding to
a spoken word or phrase. Repeatedly within the interval, a
short-term power spectrum is generated representing the ampli-
tude or power spectrum of the audio signal in a brief sub-
interval. For each frequency band in the short-term spectra,
the maximum'value occurring over the interval is determined,
thereby obtaining a peak power spectrum over the interval.
This peak spectrum is smoothed by averaging each maximum value
with values corresponding to adjacent frequency bands, the
width of the overall band contributing to each average beiny
approximately equal to the typical fre~uency separation between
formant frequencies (about 1000 Hz). For each of the originally
obtained sequence of short-term power spectra r the amplitude
~ value of each frequency band is divided by the corresponding
value in the smoothed peak spectrum, thereby generating a
corresponding sequence of frequency-equalized spectra~ Compari-
son of a selected group of these frequency band equalized
spectra with a data base identifying a known vocabulary provides
improved recognition when the original speech signal has been
subject to frequency distortion, e.g. by a telephone line
transmission.
Brief Description of the DraWings
... .. ..
Fig. 1 is a flow chart illustra-ting the general
sequence of operations performed in accordance with the practice
of the present invention;
,. ,
1 Fig. 2 is a schematic block diagram oE electronic
apparatus performing certain initial operations in the overall
process illustrated in Fig. l; and
Fig. 3 i5 a flow diagram of a digital computer program
performing certain subsequent procedures in the process of
Fig. 1.
Corresponding reference characters indicate corresponding
parts throughout the several views of the drawings.
Description of the Preferred Embodiment
1~
In the particular preferred embodiment which is
described herein, speech recognition is performed by an overall
apparatus which involves both a specially constructed electronic
system for effecting certain analog and digital processing of
incoming speech signals and also a general-purpose digital
computer which is programmed in accordance with the present
i~vention to effect certain data reduction steps and numerical
evaluations. The division of tasks between the hardware portion
and the software portion of this system has been made so as to
obtain an overall system which can accomplish speech recognition
in real time at moderate cost. Elowever, it should be understood
that some of the tasks being performed in hardware in this
particular system could as well be performed in software and
that some of the tasks being performed by software programming
in this exampLe might also be performed by special-purpose
circuitry in a dif~erent embodiment of the inver-tion.
The successive operations performed by the presen-t
system in recogniæing speech signals are illustrated in a
general way in Fig. 1. It is useful, in connection with this
3~ initial description, to also fol:Low through the various data
rates which are involved so as to facilitate an understandiny of
- 3 -
1 the detailed operation described hereinafter. As indica~ed
previously, one aspect of the present invention is the provision
oE apparatus which will recogni~e speech signals, even though
those signals are frequency distorted, e.g. as by a telepllone
line. Thus, in Fig. 1, the voice input signal, indicated at
11, may be considered to be a voice signal received over a
telephone line encompassing any arbitrary distance or number of
switchiny interchanges~
As will become apparent in the course of the following
description, the present method and apparatus are concerned
with the recognition of speech segments containing a sequence
of sounds or phonemes. In the following description and in the
claims, reference is made to "an interval corresponding to a
spoken word" since this is a convenient way of expressing a
minimum length of time appropriate for encompassing a recognizable
sequence of sounds. This term, however, should be broadly and
gënerically construed so as to encompass a series of words in
the gramrnatical sense, as well as a sinyle word.
In the particular implementation illustrated, the
interval corresponding to a spoken word is taken somewhat
arbitrarily as a one second interval~ Various techniques are
known in the art for determining when to start or initiate this
interval. In general, the particular technique used forms no
part of the present invention. However, it is at present
preferred that the interval be initiated when the input
si.gnal power, calculated as described hereinafter, exceeds a
preset threshold more than half the time in a sliding window
o~ about 30 consecutively generated spectra of the voice signal,
digiti.zed as described hereinafter.
After being amplitude normalized by an analog a.y.c.
'~j1
$
1 circuit, the voice siynal is digitized, that is, the siynal
amplitude is converted to digital form. In the present example,
an 8-bit binary representation of the signal amplitude is
generated at a rate of 10,000 conversions per second. An
autocorrelator 17 processes this input signal to generate an
autocorrelation function 100 times per second, as indicated at
19. Each autocorrelation function comprises 32 values or
channels, each value being calculated to a 24-bit resolution.
The autocorrelator is described in greater detail hereinafter
with reference to Fig. 2.
The autocorrelation functions 19 are sub~ected to
Fourier transformation, as indicated at 21, so as to obtain
corresponding power spectra 23. These "original" spectra are
calculated at the same repetition rate as the autocorrelation
functions, i.e. 32 channels each having a resolution of 16 bits.
As will be understood, each of the 32 channels in each spectrum
represents a frequency band. In the present embodiment, the
Fourier transformation, as well as subsequenk processing steps,
are performed under the control of a general-purpose digital
~ computer, appropriately programmed, utilizing a peripheral
array processor for speeding the arithmetic operations required
r~peatedly in the present method. The particular computer
employed is a model PDPll* manufactured by the Digital Equipment
Corporation of Maynard, Massachusetts~ and the programming
described hereinafter with reference ko Fig. 3 is substantially
predicated upon the capabilities and characteristics of that
commercially available computer.
Each of the successive short-term power spectra are
frequency band equalized, as indicated at 25, this equalization
~eing performed as a function of the peak amplitude occurring in
each frequency band over the interval as described in greater
detail hereinafter. Again, the equalized spectra, designated 26,
-- 5 --
* Trade Mark
3~
1 are generated at the rate of lO0 per secorld,each frequenc~ band
- equali~ed spectra having 32 channels evaluated to lfi-bit binary
accuracy~
In order to compensate for differences in speakin~
rate, the system then performs a redistribution or compensation
predicated on the passage of subjective time. While this
compensation is described in greater detail hereinafter, it Inay
for the present be noted that this evaluation consists essenti-
ally of the accumu~ation of the magnitude of all amplitude
~0 changes in all of the different frequency channels over the
interval of interest~ This accumulation is performed at 29.
Since the recognition of speech to some extent depends upon the
way in which formant frequencies shift, it can be seen that the
rate of shift is indicative of the speaking rate. Furthex, such
shi-ts will be reflected in changes in the amplitudes in the
frequency channels involved.
The sub~ective time evaluation provides a basis for
seléction of a limited number of the frequency band equalized
spectra within the interval, which selected samples are fairly
representative of the spoken word. As indicated previously, the
short~term power spectra themselves are generated at the rate of
lO0 per second. As will be unders-tood, however, much o~ the
data i5 redundant. In the practice of the present invention, it
has been found that 12 of the frequency band equalized spectra
provide an adequate representation of a short word or sequence
of phonemes, appropriate for recognition purposes. The subjec-
tive time evaluation is therefore employed to divide the entire
interval ~approximately one second) into 12 periods of equal
subjective time value and to select a corresponding short-term
power spectra for each such period, the selection being performed
at 3I.
. ,~ ,. ., ~,
~8~3
1 In order to ~acilitate the final eval,uation o~ the
spoken word, the amplitude values of the selected spectra are
subjected to a non-linear scalar trans~ormation, as i.ndicated
at 35. This transformation is de.scribed i.n greater de-tail
he.reina~ter but it may be noted at this point that this trans-
formation improves the accuracy with which an wnknown speech
signal may be matched with a reference vocabulary. In the
embodiment illustrated, this transformation i5 performed on all
o the frequency band equilized spectra, in parallel with the
~0 accumulation which evaluates subjec~ive time and prior to the
selection of representative samples. This actual comparison
of the selected spectra with the data base is per~ormed a~ter ~
vector transformation; indicated at 37, the product of tlle vector
transformation being applied to a likelihood evaluator indicated
at 41.
Preprocessor
._
In the apparatus illustrated in Fig. 2, an autocorrela-
. tion function and an averaging function are per~ormed digit~lly
on a data stream generated by the analog-to-digi.tal converter
13'wh'ich digitizes the analog voice signal 11. The digital
processing functions, as well as the input analoc3-to-digital
conversion are timed under the control of a clock oscillator
51n Clock oscillator 51 provides a basic tim,ing signal at
320,000 pulses per second and this signal is appliea to a
frequency divider 52 so as to obtain a second timing signal
at 10,000 pulses per second. This slower timing signal contro~s
the'analog-to-digital converter 13 together with a latch 53
which:holds the 8-bit results of the last conversion unti] the
next conversion is completed~ Pri.or to being applied to the
3~ latch,' the'digital value is converted to
'! ~ "~1
1 a sign magnitude representation, as indicated at 54, frorn the
usual representation provided by conventional analog digital
converters, such as that indicated at 13.
The autocorrelation products clesired are generated by
a digital mu].tiplier 56 together with a 32 word shift register
58 and appropriate control circuitry. The shift register 58
is operated in a recirculating rnode and is driven by the faster
clock frequency so that one complete circulation of data is
accomplished for each analog-to-digital conversion. One input
to the digital multiplier 56 is taken from the latch 53 while
the other input to the multiplier is taken from the currer~t
output of the shift register, the multiplications being per-
formed at the higher clock frequency. Thus, each value obtained
from the conversion is multiplied with each of the preceding
31 conversion values. As will be understood by those skilled
in the art, the signals thereby generated are equivalent to
multiplying the input signal by itself delayed in time by 32
different time increments. To produce the zero-delay
correlation ~i.e. the power), a multiplexer 59 causes the current
value to be multiplied by itself at the time each new value
is being introduced into the shift register, this timing
function being indicated at 60.
A~ will also be understood by those skilled in the art,
the products from a single conversion together with its 3]
predecessors will not be fairly representative of the energy
distribution or spectrum of the signal over any reasonable
sampling interval. Accordingly, the apparatus of Fig. 2 provides
for the averaging of these sets oE products,
To facilitate the addit.ive process of averaging, the
sign/magnitude/binary representation of the individual auto-
1 correlation products generated b~ multiplier 56 is converted to
a two's-complement code as indica~ed at 61. The accumulation
process which effects averaging is provided by a 32-word shift
reqlster 63 which is interconnected with an adder ~5 so as to
form a set of 32 accumulators. Thus, each word can be recirculated
after having added to it the corresponding increment from the
digital multiplier. The circulation loop passes through a
gate 67 which is controlled by a divider circuit 69 driven by
the lower frequency clock signal. The divider 69 divides the
lower frequency clock signal by a factor which determines the
number of instantaneous autocorrelation functions which are to
be accumulated or averaged before the shift register 63 is read
out.
In the preferred example, it is assumed that 100
samples are accumulated before being read out. In other words,
N for the divide-by-N divider is one hundred. After 100 samples
have thus been transformed and accumulated, the timing circuit
69 triggers a computer interrupt circuit 71. At this time,
the contents of the shift register 63 are read into the
~ computer's memory through suitable interface circuitry 73, the
32 successive words in the register being presented successively
to the interface~ As will be understood by those skilled in the
art, this reading in of data may be typically performed by a
direct memory access procedure. Predicated on the averaging
of lQO samples, and an initial sampling rate of 10,000 per
second, it will be seen that 100 averaged autocorrelation
functions will be provided to the computer every second. While
t-he shiEt register contents are being read out to the computer,
the ga-te 67 is closed so that each of the words in the shift
reyister i.5 effectlvely reset back to zero to permit the accu-
mulation to begin again.
_ g _
1 E~pressed in mathematical terms, the op~ration o~ the
apparatus shown in Fig. 2 may he described as follo~s. Assuming
the analog-to-digital converter generates the time series S(t),
S(t-T), S(t-2T)...the digital correlator circuitry of Eig. 2
may be considered to compute the autocorrelation function
100
~(j,t) = ~ S(t-kT) S (t-[k-j]T)
k=l
After an interval correspondlng to a spoken word, the
digital correlator will have transferred to the computer a
~0
series of data blocks representing the spoken word. Assuming
that the interval of interest is in the order of one second,
there will be 100 blocks of data, each comprising 32 words of
24 bits each. Fuxther, each block of data represents an
autocorrelation function derived from a corresponding sub-
interval of the overall interval under consideration. In the
embodiment illustrated, the processing of this data from this
point on in the system is performed by a general-purpose digital
computer, appropriately programmed, The flow chart which
includes the function provided by the computer program is given
in Fig. 3. Again, however, it should be pointed out that
various of these steps could also be performed by hardware rather
than software and that, likewise, certain of the functions
performed by the apparatus of Fig. 2 could additionally be
performed in the software by corresponding revision of the
flow chart of Fig. 3.
Although the digital correlator of Fig. 2 performs some
time averaging of the autocorrelation functions generated on
an instantaneous basis, the averaged autocorrelation functions
read out to the computer may still contain some anomalous
discontinuities or unevennesses which might interfere with
-- 10 --
1 orderly processing and evaluation of the samples. ~ccordinyly,
each block of data is first smoothed with respect to time, i.e.
with respect to adjacent channels defining the function, which
channels correspond to successive delay periods. This is
indicated in the flow chart of Fig. 3 at 79. The preferred
smoothing process is a two-pole convolutional procedure in
which the smoothed output ~s(j,t) is yiven by
~S~i~t)=co~ t)~cl~s(i~t-looT)+c2~s(i~t-2ooT) where ~(j,t) is
the unsmoothed input autocorrelation and l~ (j,t) is the smoothed
10 output autocorrelation for the j th value of time delay; t
denotes real time; and T denotes the time interval between
consecutively generated autocorrelation functions (equal toO.0001
second in the preferred embodimen~). The constants CO, Cl,
C2 are chosen to give the smoothing function an approximately
Gaussian impulse response with a frequency cutoff of approxi-
mately 20 Hz. As indicated, this smoothing function is
applied separately for each delay j. As indicated at 81,
a cosine Fourier transform is then applied to each auto-
correlation function so as to generate a 32-point power spectrum~
Th~ spectrum is defined as
l 31
2 ~5 ~~ t) ~ (j,t) cos 2~FojK
As will be understood, each point or value within each spectrum
represents a corresponding band of frequencies. While this
Fourier transform can be perform comple-tely within the con-
ventional computer hardware,the process is speeded considerably
if an external hardware multiplier or Fast-E'ourier-Transform
peripheral device is utilized. The construction and operation
of such modules are well known in the art, however~ and are not
3~
described in detail herein. After the cosine Fourier transform
-- 11 --
L3
1 has been applied, each of the resultiny power spectra is
smoothed~ at 83, by means of a Hamming window~ As indicated,
these functions are performed on each block of da1ta and the
program loops, as indicated at 85, until the overall word
interval, about one second, is completed~
As the successive short term power spectra represent-
ing the word interval are processed khrough the loop comprising
steps 79-85, a record is kept of the highest amplitude occurring
within each frequency band. Initally the peak amplitude occurr-
ing in the entire word is searched out or detected, as indicatedat 87. Starting at the beginning of the word (Step 88) a loop
is then run, comprising steps 89-91 which detects the peak
occurring within each frequency band and these peak values are
stored. At the end of the word interval, the peak values define
a peak spectrum. The peak spectrum is then smoothes by averaging
each peak value with values corresponding to adjacent frequencies,
the width of the overall band of frequencies contributing to the
average value being approximately equal to the typical frequency
separation between formant frequencies. This step is indicated
~Q at 93. As will be understood by those skilled in the speech
recognition art, this separation is in the order of 1000 Hz. By
avexaging in this particular way, the useful information in the
spectra, that is, the local variation in formant frequencies, is
retained whereas overall or gross emphasis in the frequency
spectxum is suppressed. The overa~l peak amplitude, determined
at step 87, is then employed to restore the peak amplitude of
the smoothed peak spectrum to a level equal to the original peak
amplitude. This step is indicated at 94 and is employed to
allow maximum utiiization of the dynamic range of the system.
- 12 -
?
"
3~ ~ ~
After obtaining the smoothed peak amplitude spectrum,
the successive individual short ~,errn spectra representing the
incoming audio siynal are frequency compensated b~ dividing the
amplitude value for each ~requency band within each short-term
spectrum by the corresponding value in the smoothed peak spect-
rum. This step is indicated at 99, being part o~ a loop which
processes the entire word and which comprises steps 98-102.
This then generates a sequence of frequency band equalized
spectra which emphasize changes in the frequency content of the
incoming audio signal while suppressing any generalized frequency
emphasis or distortion. This method of frequency compensation
has been found to be highly advantageous in the recognition of
speech signals transmitted over telephone lines compared with
the more usual systems of frequency compensation in which the
basis for compensation is the average power level, either in the
whole signal or in each respective frequency band.
At this point, it is useEul to point out that, while
the successive short-term power spectra have been variously
processed and equalized, the data xepresenting the spoken word
still comprises in the order o~ 100 spec-tra, each spectrum having
been normalized and frequency compensated in such a way that
shifts in individaul formant frequencies from one short-term
power spectra to another are emphasi2ed.
As in various prior art systems, the speech recognition
performed by the procedure of the present invention utilized
the patterns and shi~ts in patterns of formant frequencies to
recognize words in its vocabulary. In oraer to permit the
recognitio,n of pattern shifts even if speaking rate is varied,
the preferred embodiment of the system generates a parameter
which'may be'considered to be a measurement of subjective time.
In the'present system, a value corresponding to this parameter
ls generated relatively simply by accumulating or
;3 ' - 13 -
1 summing the absolute values of the chanye in the amplitude
oE each frequency band from one successive frequency band
equalized spectrum to the next and summing over all
the frequency bands as well. If the spectrum, valued over
32 frequency bands, is considered to be a vector in 32 dimensions,
th~ movement of the tip of this vector from one spectrum to
the next may be considered to be an increment of arc length.
Further, t~e sum of the changes in the various dimensions is
a sufficiently accurate representation of arc length for this
purpose. By accumulating the arc length increments over the
entire word interval, a cumulative arc length may be obtained.
Accordingly, when the speaker stretches out a phoneme in his
pronunciation, the accumulation of arc length will grow only
very slightly and yet will grow quickly when the speaking rate
is accelerated. The accumulation process is lndicated at 101
in Fig. 3.
Preferably, the contrîbutions from the different
frequency bands are weighted, prior to this latter summing, so
that the phonetically more significant frequencies exert a
~ greater effect. In o~her words, the magnitude of the am-
plitude change, in each frequency band, between two consecutively
evaluated spectra is multiplied by a constant weighting factor
associated with that frequency band. The weighted magnitudes of
the changes are then summed over all the frequency bands to
yield the increment of subjective time elapsed between the two
spectra.
Changes that occur in the frequency range normally
occupied by the lowest three formant resonances of the vocal
tract are found to be much more valuable in correcting for the
rate of articulation than changes at higher frequencies. In
- 14 -
..,.~ ,.
1 fact, the relative contributions at freqllencies above 2500 i~z
are so low that the weights in these frequency bands may be cet
to zero with no statistically significant effect on the results.
A table of the weighting factors, optimized for the
preferred embodiment in a particular practical application of
` the method, is presented belowO The values given are not
intended to be restrictive, and in fact the optimum values may
depend on the particulars of the spectrum analysis method
employed, the vocabulary of words to be recognized, and the
sex and age of the talkers. These values do, however, represent
an effort to reach a best compromise for talker-independent
recognition of a general English vocabulary. Tahle of weighting
factors for subjective time calculation
Frequency band Relative
Center, Hz Weighting Factor
0 0.254
159 0.261
317 0.736
476 1.000
635 0.637
794 0.377
~O 952 0.240
1111 0.26~
1270 0.377
1429 0.470
1587 0.381
1746 0.254
1905 0.181
20~3 0'079
2222 0 025
2381 0.002
When a value or parameter representing the -total arc
length is obtained, it is then divided into 12 equal increments.
For each such increment one block of data rep.resenting a
representative ~requency band equalized spectrum is selected, as
indicated at lQ5. Thus, the number of frequency band equilized
- 15 -
~ ";~:
'`~' .
1 spectra required to represent the sample intervalis reduced by a factor of about eight. However) it should be
understood that, due to ~hc so calle-l subjective time
evaluation, this ls not equivalent to selectiny one sample
for every eight spectra calculatec~ The original sampling rate
is constant with respect to absolute time hut the selected
samples will be equally spaced with respect to subjective time,
i.e. as measured in accordance with the method described above.
Either just prior to or just following the selection
process, the spectra are subjected to an amplitude trans-
formation, indicatecl at 107~ which effects a non~linear scaling.
Assuming the individual spectra to be designated as S(f,t),
where f indexes the different frequency bands and t denotes
real time, the non~linearly scaled spectrum Sl(f,t) is the
linear fraction function
S(f,t)-A
Sl(~,t)
S(f,~)+~
where A is the average value of the spectrum defined as follows:
32
~ 32
This scaling produces a soft threshold and gradual
saturation effect for spectral intensities which deviate
greatly from the short-term average A. For intensities nearer
each average, the function is approximately linear. Further
from the average, it is approximately logarithmic and at extreme
values it is nearly constant. On a logarithmic scale, the
function Sl(f/t) is symmetric about zero and the functlon
exhibits threshold and saturation behavior that is suggestive
3 of an auditory nerve firing rate function. In practice, the
overall recognltion system performs significantly better with
- 16 -
,. ~.
1 this particular ncn-linear scaling function than it does ~Jith
either a linear or a logarithmic scaling of the spec~rum
amplitudes.
A li.near matrix operation next transformseach equallzed
spectrum into a set of coefficients in which phonetic attributes
of the processed speech are enhanced. Symbolically, the
transformation applies coefficients Pi; linearly to tlle spectrum
to obtain numerical values for a set of feature data xi.
32
Xi(t) = ~1 Pij S(j~t). (1)
The coef~icients are evaluated from a sample collection
of spoken word inputs to be recognized so that the average value
- f Xi is a minimum when the input signal is in the ith pre-
defined phonetic class, while xi is as large as posslble if
the input belongs to a class other than the ith class. The
coefficients Pi; which best satisfy one or the other o~ these
criteria can be evaluated by analyzing examples of known speech
input waveforms using well-known statistical techniques of
~ linear system theory,multidimensional scaling theory, and
factor analysisO
For the purpose of evaluating the transformation
coefficients Pij, a "phonetic class" is defined to contain
whatever sound occurs at one of the séquentially numbered
selected samples of a designated word of the vocabulary to
be recognized. Even though the same nominal phoneme may occur
in different words or in different syllables of the same word,
the acoustic properties of the sound become modified, often
substantially, by the surrounding phonetic context; hence the
3~ phonetic cl~6ses employed here are context-specific.
- 17 -
It is possible to take advantage of this contextual
modi.fication by having an increased number of linear trans~
formation coefficients Pij act simultar-eously on two or more
consecutively selected spectra. This alternate procedure, while
more complex, differentiates syllables more re].ia~ly than the
phonetlc transformation differentiates phonemes.
The selected, transformed data
x = {xi(tk), i = 1,... , 32; k = 1,... , 123 (2)
are f.inally applied as inputs to a statistical likelihood
i~ calculation, indicated at 131. This processor computes a
measure of the probability that the unknown input speech matches
each of the reference words of the machine's vocabulary.
Typically, each datum xi(tk) has a slightly skew probability
densit~, but nevertheless is well approximated statistically
by a normal distribution with mean value m(i,k) and variance
~s(i,k)~2 The simples-t implementation of the process assumes
that the data ~ssociated with different values of i and k are
uncorrelated, so that the joint probability densit~ for all the
data x comprising a given spoken word input is (logarithmically~
ln p(x) = -~ ln ~2~s(i,k) - 1/2 ~xi(tk) - m(i,k)l2 (3)
., . . _
s(i,k)
which can be rewritten as~
ln p(x) = -~ ln~s(i,k)
i,k
l~k ( ~ s(.i,k~2 ) (xi(tk) m(i,k)) 2
or
X = c ~- ~ br ( Xr Mr )
where r is indexed o~er all i and k. Since the logaritham is a
3~
- 18 -
.,~ ,,~ .
1 monotonic function, this statistic is sufEicient to determine
whether the probability of a match with one vocabular~ word is
greater or less than the probability of ~ match with some other
vocabulary word. E~ch word in the vocabulary has its own set of
statistical reference parameters m~i,k), s(i,k~. Each of these
sets of parameters is compared with the set of data until the
input speech has been tested against all the words of the
vocabulary. The resulting statistical table ranks the various
vocabulary choices in accordance with their relative like]ihood
or occurrence.
The determination of Pij and the set of coefficients
(ai, bi, c) or the equivalent (mi k~ s(i,k)) is well known in
the pattern recognition art as described in Atal, Automatic
Speaker Recognition Based on Pitch Contours, JOSA, 52, pp~ 1687-
.. . . . . .. _ ... .. _
1697 (1972); and Klein et al, Vowel Spectra! Vowel Sapces, andVowel Identification, JOSA, 4~, pp. 999-1009 (1970).
As will be understood by those skilled in the art,
this ranking constitutes the speech recognition insofar as it
can be performed from single word samples. This ranking can be
utilized in various ways in an overall system depending upon
the ultimate function to be performed. In certain systems,
e.g., telephonic data entry, a simple first choice trial and
error system may be entirely adequate. In others it may be
desired to employ con-textual knowledge or conventions in order
to improve the accuracy of recognition of whole sentences. Such
modifications, however, go beyond the scope of the present
inventions and are not treated herein.
As indicated previously, a presently preferred
embodiment of the invention was constructed in which signal and
3~ data manipulation, beyond that performed by the preprocessor
-- 19 --
P.~
3~3
of Fig. 2, was implemented by a Dig.ital Equipment Corporation
PDP 11 computer.
The detailed programs which provide the functions
described in relation to the ~low chart o~ Fig. 3 do not form
part of the invention. ~t would be well withi.n the skill of
one skilled in the programming arts to prepare an appropriate
instruction list to implement the functions described in the
flow chart of Fig. 3.
In view of the foregoing, it may be seen that several
objects of the present invention are achieved and other
advantageous results have been attained.
As various changes could be ~ade in the above con-
structions without departing from the scope of the invention,
it should be understood that all matter contained in the above
description or shown in the accompanying drawings shall be
interpreted as illustrative and not in a limiting sense.
~0
- 20 -
~ 7~
'~ '