Language selection

Search

Patent 1229921 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent: (11) CA 1229921
(21) Application Number: 1229921
(54) English Title: SPEECH RECOGNITION METHOD HAVING NOISE IMMUNITY
(54) French Title: METHODE DE RECONNAISSANCE DE LA PAROLE INSENSIBLE AU BRUIT
Status: Term Expired - Post Grant
Bibliographic Data
(51) International Patent Classification (IPC):
  • G10L 15/06 (2013.01)
  • G10L 15/10 (2006.01)
  • G10L 25/78 (2013.01)
(72) Inventors :
  • KLOVSTAD, JOHN W. (United States of America)
  • LEE, CHIN-HUI (United States of America)
  • GANESAN, KALYAN (United States of America)
(73) Owners :
  • EXXON RESEARCH AND ENGINEERING COMPANY
(71) Applicants :
  • EXXON RESEARCH AND ENGINEERING COMPANY (United States of America)
(74) Agent: RICHES, MCKENZIE & HERBERT LLP
(74) Associate agent:
(45) Issued: 1987-12-01
(22) Filed Date: 1985-01-17
Availability of licence: N/A
Dedicated to the Public: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): No

(30) Application Priority Data: None

Abstracts

English Abstract


Abstract of the Disclosure
A speech recognition method and apparatus
employ a speech processing circuitry for repetitively
deriving from a speech input, at a frame repetition
rate, a plurality of acoustic parameters. The
acoustic parameters represent the speech input signal
for a frame time. A plurality of template matching
and cost processing circuitries are connected to a
system bus, along with the speech processing
circuitry, for determining, or identifying, the speech
units in the input speech, by comparing the acoustic
parameters with stored template patterns. The
apparatus can be expanded by adding more template
matching and cost processing circuitry to the bus
thereby increasing the speech recognition capacity of
the apparatus. The template matching and cost
processing circuitries provide distributed processing,
on demand, of the acoustic parameters for generating
through a dynamic programming technique the
recognition decision. The apparatus employs a method
for inhibiting a response to nonvocabulary utterances
to effectively provide improved noise immunity.


Claims

Note: Claims are shown in the official language in which they were submitted.


-46-
1. In a speech recognition apparatus
wherein speech units are each characterized by a
sequence of template patterns, and having
means for processing a speech input
signal for repetitively deriving therefrom, at a frame
repetition rate, a plurality of speech recognition
acoustic parameters, and
means responsive to said acoustic
parameters
for generating likelihood costs between
said acoustic parameters and said speech template
patterns, and
for processing said likelihood costs for
determining the speech units in said speech input
signal,
a method for inhibiting a response to
nonvocabulary utterances in a speech input comprising
the steps of
repeatedly, at a frame repetition rate,
generating acoustic parameters representing said
speech input,
generating likelihood costs at each
frame time for said acoustic parameters and said
template patterns, said template patterns including a
pattern representing silence,
beginning a normal speech recognition
process whenever said cost for an active template
pattern is better than a predetermined threshold
value, and
reverting to a non-speech recognition
process whenever said cost of said template patterns,
including silence, is worse than said predetermined
threshold value.
2. The method of claim 1 further comprising
the steps of
setting a second threshold value and
remaining in said non speech recognition
process until a likelihood cost of said silence
template in better than said second threshold.

-47-
3. The method of claim 2 further wherein there are two
silence template patterns, a first short silence pattern employed
during said reverting step, and a second long silence pattern
employed during said remaining step.
4. In a speech recognition apparatus wherein speech units
are each characterized by a sequence of template patterns, and
having
means for processing a speech input signal for
repetitively deriving therefrom, at a frame repetition rate, a
plurality of speech recognition acoustic parameters, and
means responsive to said acoustic parameters
for generating likelihood costs between said
acoustic parameters and said speech template patterns, and
for processing said likelihood costs for determining
the speech units in said speech input signal,
a method for inhibiting a response to nonvocabulary
utterances in a speech input comprising the steps of
repeatedly, at a frame repetition rate, generating
acoustic parameters representing said speech input,
generating likelihood costs at each frame time for
said acoustic parameters and said template patterns, said
template patterns including a pattern representing silence,
employing dynamic programming and a grammer graph
for determining in response to said likelihood costs whether
there has been a nonvocabulary utterance, said grammar graph
having a normal speech recognition branch and a non-speech
recognition branch, and
said employing step determining and selecting,
using said dynamic programming, the better scoring of said
speech recognition and said non-speech recognition branches.
5. The method of claim 4 further comprising the steps
of
assigning a fixed predetermined threshold score to
an entrance arc of said non-speech recognition branch of said
grammar,
assigning a second fixed predetermined threshold

Claim 5 continued... -48-
score to a self loop arc of said non-recognition branch for
providing a path to remain in said non-recognition branch, and
providing a silence arc leading from said self
loop arc to the beginning of said entrance arc.
6. The method of claim 5 further comprising the step of
assigning a self loop silence arc at a grammar
node from which said entrance arc originates.
7. The method of claim 6 further comprising the steps
of
providing said silence arc with a good score for a
long silence, and
providing said self loop silence arc with a good
score for a short silence.

Description

Note: Descriptions are shown in the official language in which they were submitted.


~2299~L
SPEECH RECOGNITION METHOD
HAVING NO IS IMMUNITY
;: '
ack~round of the Invention
:: The invention relates generally to the field
of speech recognition and in particular to the recog-
notion of speech elements in continuous speech
Jo The need for speech recognition equipment
which is reliable and reasonably priced lo well dock-
minted in: the technical literature:. Speech recognition
equipment generally falls in two main categories. One
category is speaker independent equipment wherein the
speech recognition apparatus is designed to recognize
elements of speech from any person However speaker
independent systems can be quite limited with regard
: to features other than the "speaker independence", for
example, the number of words irk the recognition vocal-
ulary. Also, typically, five to ten percent of the
Jo population will not be recognized my such systems. The
i: other category speaker dependent speech recognition,
relates to speech recognition apparatus which are tub-
.
staunchly trained to recognize speech element of a
:: ,
'

~2~39;~
--2--
limited class, and in particular the class consisting
of one person. Within each category, the speech fee-
ignition apparatus can be directed to the recognition
of either continuous speech, that is, speech wherein
the boundaries of the speech elements are not defined,
or to isolated speech, that is, speech in which the
boundaries of the speech elements are à priori known.
An important difference between continuous and is-
fated speech recognition is that in continuous speech
the equipment must make complex "decisions' regarding
the beginnings and ends of the speech elements being
received. For isolated speech, as noted above, the
incoming audio signal is isolated or bounded by either
a given protocol or nether external means which makes
the boundary decisions relatively simple.
There exist today many commercial systems
for recognizing speech. These systems operate in
either a speaker independent environment (as example-
fled for example by USE Patents 4,038,503; 4,227,176;
4,228,498; and 4,241,329 assigned to the assignee of
this invention) or in the speaker dependent environ-
mint. In addition, the commercially available equip-
mint variously operate in either an isolated speech or
a continuous speech environment.
The commercially available equipment, how-
ever, is expensive when high recognition performance
is required This it often a result of the best
equipment being built for the most difficult problem,
that is, speaker independent, continuous speech recog-
Nina. Consequently, many of the otherwise available
applications to which speech recognition equipment
could be adapted have not been considered because of
the price/performance relationship of the equipment.

~29~
Furthermore, the commercially available equipment can-
not easily be expanded to provide added capability at
a later date, and/or does not have the required awoke-
racy or speed when operating outside of the laboratory
environment
Primary objects of the present invention are
therefore an accurate, reliable, reasonably priced
continuous speech recognition method and apparatus
which can operate outside of the laboratory
environment and which enable the user to quickly and
easily establish an operating relationship therewith
Other objects of the invention are a method and
apparatus generally directed to speaker dependent,
continuous speech recognition, and which have a low
false alarm rate, high structural uniformity, easy
training to a speaker, and real time operation.
Summary of the Invention
The invention relates to a speech recognition
apparatus and method in which speech units, for exam-
~2:0 pie words, are characterized by a sequence of template patterns. The speech apparatus includes circuitry for
processing a speech input signal for repetitively de-
roving therefrom, at a frame repetition rate, a plus
: reality of speech recognition acoustic parameters. The
:25 acoustic parameters thus represent the speech input signal for a frame time. The apparatus further has
circuitry responsive to the acoustic parameters or
generating likelihood costs between the acoustic par-
emoters and the speech template patterns. The
circuitry is further adapted for processing the like-
Lydia costs for determining or recognizing the speech
units of the speech input signal.

I l
--4--
The invention features a method for
inhibiting a response by the apparatus to
non-vocabulary utterances in the speech input. The
method features generating likelihood costs at each
S frame time for the acoustic parameters and template
patterns, the template patterns including a pattern
representing silence If the cost for an active
template pattern or acoustic kernel is better Han a
predetermined threshold value the system begins a
I normal speech recognition process. If the cost of the
active template patterns, including the silence
template pattern, are all worse than the predetermined
threshold value the system reverts to a non speech
recognition process. If the silence template
corresponds to the only likelihood cost which is
better than the threshold value then the system
remains in a quiescent state.
Further in accordance with the method, a
second threshold value can be set and the apparatus
will remain in the non-speech recognition process
until a likelihood cost of the silence template is
better than the second threshold. In specific
embodiments, the two thresholds can be equal and
various silence patterns can be selected to represent
short and long silence.

~99~
--5--
Brief Description of the Drug
Figure 1 is a schematic block diagram showing
an overview of the speech recognitiorJ apparatus;
inure 2 is a schematic flow diagram of the
5: signal processing portion of the illustrated embody-
mint of the invention;
Figure 3 it a detailed slow diagram of the
dynamic programming and template snatching portion
according to the invention;
.10 Figure 4 is a more detailed schematic block
diagram of he uniform processing circuitry portion
according to the invention;
Figure 5 it a schematic representation of a
dynamic progra~Nning grammar graph according to the
invention;
Figure 6 is a schematic representation of a
dynamic programming word graph according to the
invention;
: Figure 7 it a schematic representation of a
grammar using a single element joke word according to
: the invention;
Figure B is a schematic representation of a
grammar using a double joker word according to the
inventions

--6--
Figure pa is a schematic representation of a
- grammar using a double joker word as a prowled to
speech recognition;
Figure 9 is a diagrammatic block diagram of
a prior art apparatus; and
Figure 10 is a diagrammatic lattice
presentation of the dynamic programming approach.
D i tion-of a Preferred Embodiment
esquire P
An Overview
Referring to Figure 1, a speech recognition
apparatus 10 according to the invention receive an
audio input over a line 12~ The audio input is buff
freed and filtered by a preamplifier 14. The prearm-
plifier 14 has an anti-alia~ing jilter and provides an
lo output voltage over a line 16 which ha the proper
voltage values (that is, is normalized) for an analog-
to-digital converter 18~ In the illustrated embody-
mint, the analog-to-digital converter operates at a
rate of 16,900, twelve bit conversions per second and
provides the twelve bit output on lines 20. The
twelve bit output of the analog-to-digital converter
is directed to a buffer memory 22. Tube buffer 22 has
a capacity for storing up to 320 samples or twenty
milliseconds of speech. Ire minimum requirement is
that buffer 22 be able to store slightly more than 160
samples.
The buffer I is connected to an internal
data bus 24. The bus 24 acts as the primary data come
monkeyshines bus far the apparatus. The bus 24 thus

-
~L2~9~
--7--
interconnects buffer 22 with a signal processing air-
quoter I a f first template matching and Dynamic pro -
tramming circuitry 28, a second template ~at~hinq and
dynamic programming circuitry 30, a process con-
trolling circuitry 32~, an acoustic parameter buffer
memory 34, and a trace back buffer 36~,
The recognition prowesses is controlled by the
process collateral circuitry 32 which incorporates, for
- example, a commercial microprocessor as the control
element. The process control circuitry uses the memory
34 to enable the apparatus to operate with a variate
processing rate. This means that the entire apparatus
need not meet peak load demands in real time but only
needs to operate in real time with respect to the
average processing load. the hardware savings involved
by employing this process configuration are substantial
and are discussed in greater detail below.
Each of the circuitries 26, 28, and 30, in
the illustrate embodiment, are structurally identical.
: 20 They aye modified by the Satyr programs placed
therein to per~crm either the audio signal processing
of circuitry 26 or the template matching and dynamic
programming of circuitries I and 30. This is
discussed in more detail below. Each of the circuit-
ryes 26~ 28, and 30 have therein a small 2,000 word
memory aye, aye, and aye respectively sixteen bit
works. These memories act as small fast" buffers
providing sufficient storage for continuous processing
in thy sorters 26, 28, and 30. Ike apparatus 10
rocketry can be easily expanded along bus 24 by adding
additional template matching and dynamic processing

\ I``` )
I
circuitries such as circuitries 28 or 30~ to process
a larger no ignition vocabulary. This expansion gape
ability is a direct result ox the chosen architecture
which dictates that template matching and dynamic pro-
sousing be executed on the same machine board and in
this particular embodiment by the use of identical
circuitries or circuitries 28 and 30.
In the illustrated embodiment r each of the
sequiturs 26, pa, and 30 occupy one complete board
of the assembly. While it would have been desirable
to combine two ox the data processing boards, the pry-
steal size of the combined board would not fit into
the rewaken, and hence in today's semiconductor tech-
neology, it way not feasible to combine the boards.
:15 The structure which has been developed, however, not
only allows new boards to be added along bus 24 as de-
scribed above, but reduces the data communications
"log jay" that typically occurs in prior apparatus
using separate template matching and dynamic pro-
ZOO tramming circuitries (see for example Figure g which
shows a signal processing circuit 46 feeding a them-
plate watching circuit 48 and the template matching
circuit 48 feeding a dynamic programming circuit 50).
As a result of using separate template matching and
dynamic programming circuitries, bandwidth problems
invariably occur at the intrinsically high bandwidth
connection 52 and must be solved. In the illustrated
embodiment of the invention the apparatus structure
Lowe for parallel processing on demand as described
below, thereby reducing the bandwidth retirement
along the bust correspondingly decreasing the cost of
the apparatus,

~2;~99;;~
g
The speech recognition apparatus and method
implemented by the illustrated embodiment processes
the audio input over line 12, in the signal processing
circuitry 26, to provide a sex of acoustic parameters
characterizing the input speech. It is these pane-
meters which are stored in memory 34. The set of
acoustic parameters can be thought of, in the thus-
treated embodiment, as a vector of sixteen eight-bit
numbers or components. A new vector of acoustic pane-
meters is generated each frame time, a frame time in
the illustrated embodiment being ten milliseconds.
The acoustic parameters are called for, on
demand, by the first and second template matching and
dynamic programming circuitries. These circuitries, in
:15 general operation, compare each vector with prescored
reference template patterns, only as needed, and gene-
rate likelihood statistics or C05tS representative of
the degree ox similarity. the previously stored rev-
erroneous template patterns characterize the speech eye-
~20 mints to be recognized using the same processing
methods employed in generating the acoustic pane-
meters. Each speech elements, for example a word, is
characterized by a sequence of template patterns as
described in more detail below. A dynamic programming
method is e~ployea to generate a recognition decision,
based upon the likelihood statistics or costs, in a
manner which allow the apparatus to operate in real
iamb
The description below first descries how
the acoustic parameters are generated. It then details
the processing of the acoustic parameters, by air-

`
~2992~L i
--10--
quoters 28 and 30, to obtain a recognition decision.
It is important to note and appreciate that circuit-
ryes 2Q and 30 are more than identical in structure;
they operate in parallel in order to distribute the
processing load and further enable real time operation.
Audio Signal Processing
Referring now to figure 2, an acoustical
input 100 over line 12 of Fly. 1) is passed at 101 to
an A/D converter (18 in Fig. 1) after the necessary
normalization and buffering (sown in Figure 1 by pro-
amplifier 14). The output of the A/D converter, after
buffering by memory 22 (Fig. 1) is processed by the
signal processing circuitry 26 as follows.
In the signal processing description that
follows, the processed value of the data will often
be clipped or normalized so that they fit into one
eight bit byte. This is done so that a subsequent
multiplication and/or accumulation does not produce
any overflow beyond sixteen bits, and with respect to
normalization to make best use of the available dynamo
to range.
The twelve bit output of the A/D converter
is differentiated and clipped at 1020 the input is
differentiated by taking toe negative first differences
between ~accessive input values. This occurs at the
16 I sampling rate, The differencing procedure no-
dupes the dynamic range of the input waveform and pro-
emphasizes the high frequency. In the frequency
domain, the effect of differentiation is multiplica-
lion by frequency which results in a six dub per octave

~99~
"boost for the high frequencies. This high frequency
reemphasis is desirable because the amplitude of the
speech signal decreases as a function of frequency.
The differentiated acoustical signal is then clipped
so that it fits into one byte. I
I
The average amplitude, the mean, and the logo the mean squared amplitude ox the differentiated and
clipped output is then determined at 10~, 105 for, in
the illustrated embodiment, a "window" having 3~0
I sample or twenty milliseconds ox speech. The log
used here is:
8 log (amplitude) - 128 eke. 1)
The result is then clipped to fit into a single byte.
While the window" used here is twenty mill
liseconds in duration, it is important to recognizethatO according to the invention, the signal processing
circuitry is intended to generate a new set of acoustic
parameters (as described below) every ten milliseconds.
Successive windows therefore overlap by ten Millie
second it the illustrated embodiment
exit, the differentiated and clipped data oft twenty millisecond window is normalized at 106 by
subtracting therefrom the mean amplitude over the
Wendy. This it in effect equivalent to subtracting
:25 thy zero frequency component, or I level, of the sign
net. The normalized data is clipped again to fit
within a single byte.
-

-12-
The normalized output from block 106 is then
"windowed" at aye. Winnowing is multiplication of the
input array by a vector of window constants at 109
which has the of foot of attenuating the data at both
:5 ends of the window. In the frequency Dylan, the
height of the side lobes is thereby seduced at the
cost of increasing the width of the center lobe thus
smoothing the resulting spectral estimate. While
various types of windowing functions exist, each
producing slightly different tradeoffs between side
lobe height and center lobe width, the window chosen
it the illustrated embodiment and which has been found
to produce statistically better results is
swanker window which consists of multiplying the
:15 sine function (sin Ajax by a Raiser function.
The Kaiser window is useful since it pane-
motorizes the trade-off between side love height and
center lobe width. Multiplying by the sine function
gives a bandwidth at each frequency in the Fourier
transform. In the illustrated embodiment, a bandwidth
of 355 hertz is used In the Kaiser function window,
the beta parameter, I, is set at I

I
--13--
The raiser function, which is described for
example in digital Filters, chapter 7 of System
nal~sis by Dig Italy Computer, Duo arid ashore, John
Wiley & Sons, New York, 1966, is given by
eke. 2)
It (13 (n/ (N-l) Jo 2 (rut/ ( No I ) ) 1/2 )
X (n)
It (B )
where It is a motif ted zeroth order Bessel function
of the first kind:
or eke. pa)
It I = Jo I, gosh (Z c08 do .
The 5 i no f unyoke t ion f r the pa r emote r s of the
illustrated embodiment is: .
(En. 3)
a I- ~N-1)/2) (n = 0,1,~. . No and
N - 320 points in the
window
where a = 2 If ( 355J2) - . û697
The waveform is windowed a ton normalization
(rather than before normalization since otherwise the
mean would introduce an extra rectangular signal which
would increase the side lobes. The windowed waveform
is block normalized so that its samples fit in thin-
teen bits. This I so that accumulation performed
during folding as described below, does Seiko overflow
si:Kt~n bits.
:
The discrete Fourier transform of the win-
dozed date i now performed at 11~ . While there are
many ways off efficiently performing a Fourier trays-
form, in order to reduce the number ox multiplications
an hence the tome to effect the Fourier transform

( )
-14-
computation, a folding technique a 113 making use of
the symmetries of the sine and cosine, by converting
the data vector into four smaller vectors, is
performed. Since the sampling of values in the
frequency domain is done at multiples of a base
frequency, each of the resulting vectors will hove a
length of 1/4 the period of the base or fundamental
frequency, also called the base period.
In the illustrated embodiment, the base ore-
unwise from a 16 kilohertz sampling rate and a 20 mill
lisecond window is chosen to be 125 hertz. (The eon-
responding base period is 128 samples.) This repro-
sent the minimum chosen spacing between spectral
frequency samples. As noted above, a 20 millisecond
window encompasses two 10 millisecond frames of the
incoming acoustic Ann digitized) signal. Successive
windows" thus have an overlapping character and each
sample contributes in two windows.
According to tube folding technique, frequent
ales are divided into odd multiples and even multiples
of the base frequency The real (multiplication by
the cosine) and the imaginary (multiplication by the
sine) components of the transform will each use one of
the resulting folded vectors for both classes ox
frequency.
The folding operation is performed in three
steps. First, the element which are of fret by the
base period are grouped together. In the illustrated
embodiment, using a base frequency of 125 hertz at a
sampling rate of 16 kilohertz, this results in 128

)
I., .
go .
--15--
points. This "fold" is a direct consequerlce of the
expression for the Fourier transform
F(nf) - I eye jnkf) (En. 4)
where f is the base frequency ~125 ho awns the sum is
extended from k = O through the number of the samples
irk the window less 1). Since
eye) = e joy + 27r ), (Erg, 5
the transformation can be rewritten as:
Enough) k Al et-2Tr jnfk) (En. 6)
:10 where the sum is extended from k = O through k = 45j-1
and is equal to 1/4 ox the base period and Al is
the sum of I x(k+4q) + x~k+8q)
the second fold operation is performed by
rewriting the vast expression (Equation 6) as:
lo Pine) k ilk lcosl2~r nfk~ j Sweeney nfk~).
ode earl then use the symmetries ox the sine and cosine
functions since sin (a) = -sin (try a) and coy (a) -
Shea try a), the tranfQrm of Equation 7 can be rewritten
as:
::~20 F(nf~=~x2c kiwi nfk) + issue Sonora nil (En. 8)
where
x2c Al (k) Al (cluck) and
x2~ = Al I - Al I k).
Thy sum extends Rome k = 0 to k = I Thus there
~25: are 64 terms where the sampling rate is 16 kilohertz
and the base frequency it 125 Horace.

~:2~9:~
-16-
The third stew of the procedure resolves the
instance of odd multiples of the base frequency. The
symmetries sin = Sweeney) and Casey = -Casey)
allow the transformation of Equation 8 to be rewritten
as:
F(nf~ = kx3Co Casey nfk) + j X3so Sweeney nfk)
(En. 9)
where
X3co = X2 coy - X2 Cossack - lo and
X3so = I sink) x2 Sweeney - l-k).
For even multiples of the base frequency the equations
are: .
F~nf) = kX3CE Casey nfk) + j x3SE Sweeney nfk)
let. 10)
where
x3cE = x2 coy + I Casey - lo and
x3sE = x2 sink) - x2 Sweeney k).
This procedure uses the equalities of sin =
-Sweeney) and coy - Casey). The sums, after the
third procedure or third fold, now extend from Nero to
k Q-l. That is 32 terms for a sampling rate of 16
kilohertz and a base frequency of 125 hertz. After
the three folds, the vectors are block normalized to
six bits.
At this point the discrete Fourier transform
is completed by multiplying the date from the folding
procedure by a matrix of sines and cosines (aye). my
calculating the multiple of the base frequency,

- )
-17-
module 2, the set of folds which are nodded to be used
can be determined. The resulting vectors are block
normalized to fit within a single byte. The result of
the Fourier analysis is two vectors of signed integers,
the integers ranging from -128 to 127 that is, one
byte), one of the vectors containing the real term of
thy Fourier analysis and the other containing the
imaginary terms of the analysis. The length of the
vectors is equal to the number of frequencies employed
I during the recognition process. In the illustrated
embodiment the number of frequencies is thirty-one and
are:
~50 1250 2250 3250
375 1375 2375 3500
SO loo 25~ 3750
625 1625 2625 4000
7~0 1750 Z750 SEIKO
87S 1~75 2875 5000
1000 20~0 ~000 5500
20 1125 2125 31~S
The next step, at 114, is to determine the
sum of the square of the real and imaginary parts of
the spectrum at each multiple of the vase frequency.
The result is divided by two, that is, shifted down
Z5 one bit so that the square root which is computed at
116 will fit into one byte.
After thy square root is taken at 116, the
no ulting spectrum it transformed, at 118, according
to the equation:
l - 128(x - monks Jean). I 11)
.. . . .. ..... . ..

- 18 -
1 Tunis function, often called the "quasi-log" and described
in more detail in US. Patent No. 4,038,50$, enhances the
speech recognition properties of the data by
redistributing the dynamic range of the signal. The
"meant' is the average value of the array of spectral
values.
The result of the quasi log is a vector having,
in the illustrated embodiment, thirty-one acoustical
parameters. Added to this vector r at 120, is the mean
amplitude value of the input signal calculated at 104 and
105. This amplitude value and the 31 element vector
resulting from the quasi-log step at 118 are applied to a
principal component transformation at. 1220 The principal
component transformation reduces the number of elements in
the vector array to provide a smaller number of acoustical
parameters to compare against the speech representing
reference template patterns prestGred in the apparatus.
This reduces both computational cost and memory cost.
The principal component transformation, at 122,
is performed in three part. In the first step a speaker
independent mean for each element of the vector is
subtracted from the respective element in the vector.
Second, the result of the first step is multiplied by a
speaker independent matrix which is formed as described
below. Finally, each element of the resulting vector
from the second step is individually shifted by an
element dependent amount and then is clipped to fit
within an eight bit byte (see 123). This essentially
means that the distribution of each component of
the vector has been normalized so that

~22~
--19 -
the byte will contain three standard deviations around
the mean of the component. The output of the
principal component analysis, an array ox bytes having
a vector length equal to a selected number of
components, is the acoustic parameter output of the
signal processing section 26 of the apparatus of
Figure 1 and is applied to the bus 24 for further
processing as described In more detail below
The principal component analysis is employed
to reduce the number of acoustic parameters used in
the pattern recognition which follows. Principal come
potent analysis it a common element of many pattern
recognition systems and is described in terms of an
eigenvector analysis for example in ITS Patent No.
4,227,177p assigned to the assignee of this inn
Zion. The concept of the analysis is to generate a
et of principal components which are a normalized
linear combination of the original parameters and
where the zecombination is effected to provide come
pennants with maximum information while maintaining
independence from each other. Thus, typically, the
first principal component is the normalized linear
combination of parameter with the highest variance
The concept is that the direction containing the most
variation also contains the most information as to
which class of speech vector belongs to
Normally, the vectors containing the linear
combinations under consideration are constrained to be
of unit length. This would be the best procedure if
all recombined parameters had the same variation with-
in the defined speech classes, but unfortunately they
do not. rho prior art anal is method is therefore

i .: )
Shea
-20~
modified to normalize all linear combinations of pane-
meters to have the same average, within class, van-
ante and the first principal component is chosen to be
that normalized linear combination with the highest
across class variance.
the principal component analysis is derived
as follow. Let M' represent the trounces of a matrix
M. Let T be the caverns matrix of some set of P
original parameters. Let W be the average within class
caverns matrix. Let V be a P dimensional vector of
coefficients, and let X be a P dimensional vector of
acoustic parameters. For simplicity ox derivation,
assume that the original parameter vector X has a zero
mean. In practice, in the illustrated embodiment the
assumption is implemented by subtracting from the oft-
Gina parameters, their respective mean values The
: variance of a linear combination of parameters VEX
is
I 12)
Expectation of [~V'X)~V'X)'] - V'(XX')V = TV
I since I is defined as the expectation of XX'. Sue-
laxly, if Wit) is the caverns of the ilk class of
parameters, then Viva is the caverns of the
parameter VEX in the ilk class. The average within
class variance is then
I Viva eke. 13)
i31
where N is the number of classes. By the distributive
property of matrix multiplication, this is just equal
to VIVA. this enchain weights ail classes equally,

independent of the number of samples contained in each
class. Actually, in the illustrated embodiment, all
samples were weighted equally on the assumption that
it is more important to discriminate between classes
that occur frequently.
Next TV is maximized under tube
constraint: VIVA = 1 (En. 14
This problem can be solved by using Lagrange multi-
plower Let the one Lagrange multiplier be denoted by
10 y. We then want to solve eke. 14) and (En. 17~ (below)
f TV - y TV - 1) let. 15)
O - df/dV 5 TV - 2yWV (En. 16)
TV - yWV eke. 17)
.
This equation 117) is just the general eigenvector
problem and is solved by P eigenvectors with real
eigenva}ues, due to the symmetry of T and W.
Thy solution which maximizes the original
criteria is given by the ei~envector V with the largest
eige~value y. This can be seen by observing that the
:20 variance of VEX it given by
TV = Viva - yV'WV - y 1 = Y (En. 18)
Let the solution to (En. 173 wit the largest
ei~envalue be Ye, and the corresponding eigenvalue
be Ye- or the next parameter, solve for the linear
combination of parameters VEX which maximizes lV'TV),
which has unity within class variance, lV'WV - 1), and
which is uncorrelated with Vl7X. The condition that

~2;2~32~
-2Z-
VEX is uncorrelated with Vex allows the analyst s to
find the linear combination of parameters which has
maximum across-c7ass variance compared to Within class
variance, but which does not include information
already available from the first component. By the
definition of uncorrelated, we have
(En. 19)
Expectation of lox (Vl'X)'l = V'(XX')Vl = V'TVl = 0
Let y and z be Lagrange multipliers, and solve (Equal-
lions 14, 19 and 21):
g = TV ytVi~v z(2V'TVl) (En. Jo)
O Y dg/dV TV - 2yWV - 2zTVl (En. 21)
multiplying through on the left by Al and dividing
by 2:
Al TV yVl TV - zVl~lVl (En. 22)
Substituting TV z 0 as a constraint; and Wylie =
T~l/yl from eke. lay in these relations; and mull
toppling through by -1:
twill) Al TRY) zylVl~WVl (En. 23)
0 = Swahili) Zulu o (En. 24)
z - 0 (En. 25)
Therefore, given Eli toe following three equations
can be solved:
TV yWV eke. 26)
VIVA - 1 (En. 27)
V'*Vl (En. 28)
Any vector which satisfies (En. 26) can be turned
ionic a vector which satisfy its botch (En. 26 and 27) by
dividing Q by the sealer Q 'WE.
Consider any two vectors, Q and R, satisfying
I 20, with corresponding eigenvalves q and r. when

~;~2992~
-23-
rQ'WR = TRY - RUT = qR'~Q - qguwR (En. 29~
trucker - ox eke. 30)
If r is not equal to then
0 R = truer eke. 31)
TRY = I eke. 323
It (En. 26) is satisfied by two vectors with different
nonzeros eige~values, then those two vectors Allah sat-
iffy I 28), and are therefore uncorrelated.
The second companion is thus chosen to be the
vector V2 which satisfies (En. 26 and 27) and which
has the second largest eigenvalue, assuming the two
largest are distinct. Correspondingly, the nth come
potent Van can be made uncorrelated with Ye through
Vn_l and the same procedure can be employed to show
that Van must only satisfy (En. 26 and 27), so long as
there are n distinct nonzeros eigenvalues.
In this manner, assuming that the character-
icky polynomial corresponding to let. I has N disk
tint nonzeros roots, a sequence of N linear combine-
lions of parameters, each one maximizing the ratio
VET VIVA), while constrained to be unco~related
with the previous linear combinations in the sequence,
con be determined. This sequence comprises the N gent
realized principal components.
In the method described above the linear
combinations of parameters that maximize TV while
holding VIVA constant are found. Since, as can be
easily shown, T - W B, By is the "average between-
ala s variance, the same results can be obtained by
maximizing V'BV while holding VIVA constant Since B
can be expressed as the em of terms involving dip-

~L~2g92~
-24-
furnaces between pattern means, intui~ive].y, by maxim
mixing V'BV, the pattern means are being moved apart
from each other. It may be that in speech recognition
it is more important to distinguish between some pat-
terns than between others, If the differences between
pattern jeans in the formulation of B are weighted by
different amounts, ho resultant principal components
will be biased to differentiate more between some pat-
terns than others. One instance for assigning dip-
fervent weights to different pattern pairs is when data
and patterns from different speakers are used in the
principal components analysis. In such a case it is
not worthwhile to try to separate patterns from dip-
fervent speakers, and all such pattern pairs should
.15 receive the weight of zero.
The creation ox the principle component ma-
trip takes place prior to the real time speech recog-
notion process. Typically, a large data base is
: require to obtain a good estimate of the matrix.
:20 The output of the principal component trays-
formation is the acoustic parameter vector. A new
"vector" or set of acoustic parameters is made avail-
able each frame time, that is each ten milliseconds
in the illustrated embodiment. The acoustic parade-
lens are available on bus 24 from the signal process-
in circuitry 26. Since each frame time has a window
repro ending twenty milliseconds of input audio, there
is as noted above, an overlap in the information rep-
resented by the acoustic parameter data the acoustic
parameter data is stored on the memory buffer OWE As
noted above a and under the control of the process con-
troller 32~ this allows a variable rate data process-

99~L
-25
in by circuitries 28 and I This it important in
order to allow the entire apparatus to meet the aver-
age real time requirements of analyzing the incoming
speech, but not to require it to maintain real time
processing throughout each speech element on frame
by frame basis.
On a frame by frame basis, the maximum them-
plate matching and dynamic programming data processing
requirements generally occur toward the middle of a
speech unit, for example a word. Correspondingly, the
processing requirements at the beginning or end of thy
speech unit are generally substantially less, in fact,
less than the capability of the described apparatus.
Thus, the use of buffer 34 to store acoustic data in
lo combination with the capabilities of the circuitries
28 and 30~ allow the average processing rate of the
apparatus to be greater than that required for real
time operation. Thus real time operation is achieved
: : without requiring the hardware to meet instantaneous
peak processing rates.
Likelihood Computations
he purpose ox the Lockwood computation is
to obtain a measure ox the similarity between the in-
put acoustic parameters from the principal component
transformation and the template patterns which are
employed for describing the elements of speech to be
recognized. In the illustrated embodiment, each
speech element it described by a sequence of template
patterns. Associated with each template pattern is a
minimum end a maximum duration the combination of the
template pattern and the minimum and maximum durations
is an acoustic kernel. Typically the speech element

Lo I
--26--
is a spoken word although the speech element can also
be a combination of worms, a single phoneJne, or any
o then speech us i t
I n accordance with the present invasion
thy chosen measure of comparison is a monotonic lung-
lion of the probability of a particular speech them-
plate given the observation of acoustic parameters at
a given frame time. queue acoustic parameters can be
thought of a the observation of a random variable.
ho random variables which model the acoustical pane-
meters have a probability distribution given by the
pure unit of sound intended by the speaker. Since the
sound which is no evade is not "pure" and for ease of
computation, the probability distribution which is
chosen is the Lapels, or double exponential, duster-
button. The Lapel distribution is written as, for a
single random variable,
f lo) - Ce-lx-u~b (En. 33~
where u is She mean of the distribution, b is inversely
proportional to the standard deviation, and c is such
thwack the density integrates to one In order Jo make
the computation easier, the logarithm a the likely-
hood rather than thy likelihood itself is chosen as
the measure to be employer. (This allows addition
rather than multiplication to be employer} in cowlick-
feting probabilities. 3 Lucy can be accomplished since
the logarithm is a monotonic function of its argue
mint. Therefore the measure can be rewritten as:
in l - ha - Ix-ulbo (En. 34)
In this computation only u and b need to be known since
the natural log of c: is determined by the condition

~L22992~3L
-27
that the density must integrate Jo one. In order to
keep most numbers positive, the opus or negative
of this measure is employed. The acoustic parameters
for a given frame are assumed to be independent so
that the f final expression or the measure of
likelihood probability becomes
Cost R six - us I by Eke. 35)
where the sum extends over all acoustic parameters,
and R is a f unction of the b ( i) .
In the illustrated embodiment, the likelihood
computation is generated for a temple e pattern "on
demand for use by the dynamic programming portion of
the apparatus. Where, as here, there are two circuit-
ryes 2B and 30 operating in parallel, it is possible
that the dynamic programming portion of circuitry 29
or I requires a likelihood score from the other Syria
quoter 30 or I this would require the transmission
of data over bus 24. The dynamic programming "dive-
soon ox labor according to the grammar revel and word
level griefs described below is chosen to minimize
this requirement for thy transmission of data on bus
I .
The input to the likelihood calculation, as
noted above, consists of the acoustic parameters at a
frame time and the statistics (u and b above describe
in the template pattern. The template pattern stay
tistics convict of a mean (us) and a weight" (by)
for each acoustic parameter, and a logarithmic term
corresponding to I In the template pattern eta
`30 fistic creation, the logarithmic tern usually has
been shifted to the right Idivlded by a power of 2) by
a quantity which is selected to keep the magnitude of
the cost within a sixteen bit integer. For each

1 acoustic parameter the absolute value of the difference
between the acoustic parameter and the mean is determined
and that quantity is multiplied by the weight associated
with the parameters These quantities are added for all
of the acoustic patterns; and the sum, if it is less than
the maximum sixteen-bit quantity, is shifted to the right
by the same quantity applied to the logarithmic term so
that the logarithmic term can then be added thereto The
result is the "likelihood" or "cost" for the employ
pattern at that frame time.
The Dynamic Programming Approach
A dynamic programming approach to speech
recognition has been used and described elsewhere, for
example, in the applicant's Canadian patent numbers
1,182,222, 1,182,223 and 1,182,224 all of which issued
February 5, 1985. The dynamic programming approach used
herein is an improvement upon the dynamic programming
approach described in these Canadian patents.
Referring to Figure 10, the dynamic programming
approach can be thought of as finding the best path 150
(that is the path with the minimum score) through a
matrix 152 in which the rows are indexed by time
indiscrete intervals (corresponding to the discrete
time intervals for which measurements are made) and the
columns represent elementary units of speech (acoustic
kernels in the illustrated embodiment). In theory
it is possible to try all possible paths through the
matrix and choose the best one. However, there are
far too many paths to consider each time and hence
in order to find a computational

I I
-29-
efficient method and apparatus for finding the best
path through the matrix, a Markov model for the speech
is considered. A stochastic process is said to be
Markovian, if ye probability of choosing any given
state at a time to depends only upon the state of the
system at time t, an not upon the way in which that
state, a time t, was reached.
In speech, there is coarticulation, the
state by which a given elementary unit of speech
affects both those unit which are spoken Before it
and after it. units of speech have an effect upon
the past because a speaker anticipates what he is
going to say.) In order to work around the coarticu-
lotion problem within a word, the template patterns
are formed for the coarticulated speech units. Lucy
method mazes it very difficult to share template be-
tweet words which theoretically have the same speech
units and is why, in the illustrated embodiment of the
invention, the apparatus does not attempt to share
such template. For the purposes of the illustrated
embodiment, coarticulation between words it ignored.
Thus, the Markovian model for speech is built
by including within each state all the information
relevant to future decisions. Thus the units of speech
are grouped into word because ultimately it is words
which will be recognized and it it at thyroid level
that syntactl~al constraints cay and, in the thus-
treated embodiment, must be applied. syntactical con-
straits are represented by a grammar graph 158 (Fig.
5) and it is the grammar graph that makes the model
Markovian. Therefore when recognizing utterances,
the state space through which the path must be found

~%2~39;21
-30-
viewed as logically existing at two levels the gram-
mar or syntax level, and the Ford level at which the
elementary speech units exist.
At the grammar level the state space consists
of a number of connected nodes. A node is a logical
point in time which lies either between, before, or
after individual words within an utterance. At each
node there is a fixed legal vocabulary, each word (or
words) of which connects tube node to a new node. A
grammar graph thus consists of a list of arcs, an arc
having a starting node, an ending node, and a list of
words which cause the transition there between tree
Fig. 5). Thor "self" arcs, the starting and ending
nodes art the same.)
~15 The second level mentioned above employs
word models. A word model is a finite state repro-
sensation ox a particular word as spoken by a portico-
far speaker. In accordance with the illustrated
embodiment, the word model employed is a linear so-
quince of acoustic "kernels. A noted above an
acoustic kernel is a jingle acoustic template pattern
having a minimum and a maximum duration. In the if-
lust rated embodiment, a word thus consists of a so-
quince of sounds (each represented by a template pat-
tern), with a minimum and maximum duration of time
being associated with each sound. There is no prove-
soon for alternate pronunciations and in accordance
with the preferred embodiment of the invention, the
method is implemented for speaker dependent speech
recognition. Thus, the method relies upon the best
estimate that the same speaker says the same word in
roughly the same way at all times.

i
-
~2919Z~
-31-
In graph farm, referring the Fig. 6, each
word model acoustic kernel has a minimum duration of
on" samples and is represented by n identical nodes
160. These are different from the grammar nodes
mentioned above The Noah nodes are strung together in
a series, each node having a single age coming into
and a single arc leaving it. The maximum duration,
that is a duration greater than the minimum duration,
can be represent Ed by a last node having an arc coming
in, an arc leaving, and a Self lop which is the
optional dwell time, that us, the difference between
the minLnum and maximum durations. All of the arcs
have the same acoustic template pattern associated
therewith, and for the self loop a count of the
number ox times through the loop must be kept in order
to accurately maintain all of the information which
is needed later during trace back).
the word model graph and the grammar model
graph are integrated by replacing, in the grammar
graph, each arc with the corresponding word models.
The connection between the grammar nodes and the word
; nod is made by what are called null arcs. pull
arcs also allow the apparatus to skip over optional
words in the grammar, for example arc 162 twig. 5)0

~22~32~
Once the needed likelihood computations are
availably*, the method proceeds to recognize speech
using those likelihood statistics and the allowable
syntax graph of the incoming speech. Pictorially
then, the graph of the utterance us f first transformed
into a lattice, within a matrix, as follows. (see,
e.g., Fig. 10) Each stave or node of the graph eon-
responds to a column of the lattice and each row of
thy lattice corresponds to a specific frame time.
Thus, a lattice state in row I corresponds to time I
and a lattice state in row 3 corresponds to a time J.
Thus, traversing the lattice between row I and row J
corresponds to obtaining the acoustic parameters for
the times between and including times Ill and J while
traversing, in the gryphon, the archways whose origin
node corresponds to the original column and whose end-
in node corresponds to the destination column.
: Imposing minimum and maximum durations upon the them-
plate patterns corresponds to restricting the vertical
:20 distance that the lattice arcs can span between two
columns.
The main thrust of the dynamic programming
employed in the present invention is, at each row (or
time) in the lattice, find the optimal path to a desk
Tunisian lattice state using the previously computed
costs of the states in the rows between the destine-
lion row and the starting row. The "optimal or best
path is equivalent to minimizing the cumulative like-
Lowe score by choosing the template pattern which
maximizes the conditional probability that the speech
unit corresponding Jo the template is the correct one
given the acoustic parameters at the frame time. That
conditional probability is maximized over all of the

go
-33-
"active templates active templates are defined
below
More specifically, the dynamic prorating
performs the following step at each Roy time:
I All nodes are initially set to an initial
maximum (16 bit) likelihood score
- 2) Destination nodes of null arcs can
inherit their score from the source node
ox the arc in zero time.
.10 3) Each "active" kernel in a word on each
grammar arc is processed using likelihood
computations and duration information
and a minimum score for the word at that
frame time is determined.
4) If the minimum score for the word is
greater than some predetermined thresh-
old, the word is deactivated to reduce
computations with successive frames.
This is effectively a method for reduce
`20 in computation based on the expectation
that this path will not be the optimum
one.
So The cumulative likelihood score of paths
: . at grammar odes, that is, at the end of
:25 words leading to a grammar node, is coy-
putted.
6) I not all of the kernels of a word are
active and toe score of the last active
kernel is less than some preselected
activation threshold, 'eke next kernel of
the word is made alive

`
32~
-34- .
Al If the score at the final grammar node
of the graph node 200 of Fugue 5, is
better (i.e. less) than the score at any
. intermediate grammar nuder then an end
of utterance has been detected
Considered in more detail, at the acoustic
kernel level, the dynamic programming uses the "seed
score for thy source node of the kernel, the cost of -
theta rustic kernel calculated from the current frame,
It and the global minimum score of the last frame, to
arrive at the likelihood score of a particular kernel
at a present frame time. As noted above, the portico-
far hardware embodiment described herein determines
the likelihood costs "on diamond Thus, as the like-
Lydia costs are required by the kernel level dynamic
:: programming for a particular frame time, the likely-
hood computation is generated. teach node correspond-
King to the kernel (recall, referring to Fig. 6, that
Jo the kernel it modeled by a plurality of nodes, one for
eschew required frame time duration can inherit as the
; "seed score a likelihood score prom a preceding
node. (For the first node of a kernel, the seed
screen is inherited prom thy last node ox a preceding
I: kern unless the first kernel node is the first node
along a grammar arc in which case, the "seed core" is
inherited from the listened of the best score leading`
into the grammar node.) In addition, thy last node
having the kernel can inherit a score from itself
(because of the use ox tube self loop in thy word
3Q model in which case the number of times that the self
loop has been traversed must be recorded. In order to
keep the accumulated costs as small as possible, all
of the likelihood scores are normalized by subtracting

~22~
- 35 -
1 the global minimum score (that is, the host score) of the
last frame. The new score is then the sum of the
inherited score plus the likelihood score or cost for that
template at that frame time. When all of the "active"
kernels have been processed, the minimum score for the
word is determined and output to the corresponding grammar
node.
f the minimum score for the word is greater
than a selected deactivation threshold r all kernels a the
word are made inactive except for the firs one. This has
the effect of reducing the required likelihood and dynamic
programming computations at the risk of possibly
discarding what might become an optimum path. On the
other hand, if the score on the last node of the last
active kernel of a word (where not all kernels are active)
is less than no activation threshold, then the next kernel
of the word is made active. If all of the kernels of the
word are active, the destination node of the current
grammar arc receives a score which is the minimum score of
all the words on all the arcs leading into the grammar
- destination node.
The dynamic programming employed here is similar
to that described in Canadian patent number 1,182,224
noted above. The primary difference between the dynamic
programming employed here and that described in the
earlier patent:, is the use here of the null arc and the
activation/de~lctivation threshold. The use of the null
arc in this Grammar advantageously provides the ability
to concatenate words which enables an easier
implementation of the apparatus according to thy
invention. And, as noted above, the activation/

!
.
I
-36~
deactivation thresholds reduce the computational
demands on the apparatus.
In the preferred embodiment of the invention,
a partial trace back is employed as a memory saving
device. At each frame time, all nodes a a time in
the past, equal to the present time Linus the maximum
duration of a word, are checked to see whether there
it an arc leading from that node to any node at a
water time. If there is not, these nodes are elm-
noted from the set of nodes which can be used in
trace back. Recursively, all nodes in the further past
that have arcs leading only to the deleted nodes are
in turn deleted. There results therefore a pruned set
; : of tr~ceback nodes which enable less memory to be em-
:15 plowed for the trace back process.
Once end-of-utterance has been detected, for
: example, when the final grammar node has a better
Lowry score than any other grimmer node of the Papa-
: fetus, a forced trace back method is employed to de-
termite, based upon the result of the dynamic pro-
tramming over the length of the utterance, what was
actually the best word sequence. The tusk starts
at the final node of the grammar graph and progresses
backwards, toward the beginning of the utterance,
through the best path. Ike output of the traceb~ck is
the recognized utterance along with the start and end
times, and if desired, a score for each word. Thus,
the output of thy feast cost path search is the most
probable utterance which is consistent with the spew
Swede grammar, the specified word models and Tom
plates, and the input acoustic parameters. Further,

I
.
-37- .
information necessary to train on the utterance is
also available from the system a described below.
TrainingJEnrollment
The description presented thus far describes
a method for recognizing speech using a plurality of
previously formed template pattern. The formation of
those template patterns is key to providing a speech
recognition system which is both effective and felt-
able Consequently, care and attention must be given
to the creation of the template patterns. In portico-
lark for the illustrated embodiment of the invention,
the apparatus is designed to be speaker dependent and
hence the template patterns are specifically biased
toward the speaker whose speech will be recognized.
Two different methods are described herein-
after for adapting the apparatus to a specific speak-
or. In a first enrollment method a zero-based
enrollment, an initial set of template patterns eon-
responding to a new word is generated solely from an
input set of acoustic paramours the set of template
patterns is created by linearly segmenting thy income
in acoustic parameters and deriving the template pat-
terns therefrom. ho second method, a training pro-
seedier, makes use of a set of acoustic parameters
derived from the spookier and the recognition results
(that is, speech statistics) of known or assumed
utterance which provide an initial trough cut" for
the template patterns, to create better templates.
This is accomplished by performing a recognition on
each word within a known utterance and using a known
worn model.

Lug
I
Turning f first to the zero-based enrollment
technique, a number of acoustic kernels, each with a
minimum and maximum duration, are set for the word
based on the duration of the word. The beginning and
end of the word are determined, as will be described
below and the frames ox acoustic parameters are then
proportionally assigned to each kernel (five frames
per kernel). The template petrol statistics, the
meat and variance, can then be calculated from the
acoustic parameters. In the illustrated embodiment,
zero-based enrollment employs, for example, ten awakes-
tic kernels for a word (an average of So frames in
Doreen, each having a minimum duration of two
frames and a maximum duration of twelve frames. There
results a set of statistics which can be employed for
desc~iking the utterance or which can be improved, or
example as described below
For the training method, the input data
include not only the acoustic parameters for the spot
ken input word, but training data from a previous
least cost path search. This data can include the
tentative beginning and ending times for a ward. If
no thresholding is performed, and there are no dwell
time constraints, the dynamic programming should give
the same result as the grammar level dynamic program-
Mooney If the dynamic programming at the word level
ended where expected, and trace back was performed
within the word at the acoustic kernel level. tube
trace hack information helps create better template
patterns for the word. As a result, a better set of
templates can ye achieved. In the illustrated embody-
mint, a special grammar consisting of the word alone
it built. All of the kernels in the word are made
active and word level dynamic programming is performed,

-) :
`: .
I
-39-
For each word used in the training process,
one of the most important aspects of effective training
is properly setting the starting time and Lowe ending
time for the word. Several procedures have been
employed. Once procedure employs a threshold value
based-upon the amplitudes of the incoming audio sign
Noel Thus for example a system Jan be designed to
listen to one pass of the speech and to take the
average of the five minimum sample values the average
minimum) and the average of the five maximum sample
values (the average maximum) during that pass. Then,
a threshold is set equal to the sum of four times the
average minimum value plus the average maximum value,
the entire sum being divided by five. The system then
goes through the speech utterance again and after
several (for example seven) frame times have
amplitudes which exceed the threshold value, the word
is declared to haze begun at the first of the frames
which exceeded thy threshold value. Similarly, at
~20 the end ox a word, the word is declared to have ended
at the end of the last of several lion example seven)
frame each of which exceeds the threshold value .
.
A preferred approach however to determine the
heginninq and end of an utterance during enrollment is
to use a threshold or joker" worn. This: provides for
noise immunity a well as an excellent method ox
determining the beginnirlg arid end of a sue etch
utterance, Referring to Figure it a grammar graph 158
employing toe "joker" word, has a self loop 160 at a
node 180, the self loop representing a short silence.
The imaginary sir joker word, represented by an arc 182
between notes 18û an 1~4 has a f iced or constant
likelihood cost per frame associated therewith An
arc 186 repr~s~ntis~g a long silence, leads from node

-40-
184 to a node 188. When silence is the input signal,
that is 3 prior to the utterance being made, the self
loop (short silence) has a good likelihood C05t per
frame and the grammar "staysail at node 180. When
speech begins, the cost per frame for silence
becomes very poor and the mixed cost of the "joker",
along an arc 1~2, is good relative thereto and
provides a path from node 180 to node 184. This is
the beginning of speech. ThPrafter, the transition
from node 184 to 188 denotes the end of speech.
While the grammar graph of Figure 7 works
adequately, improved starting and ending times during
training can be achieved using two joker" words
Referring now to Figure 8, a grammar graph 198 begins
at a nude 200, and so long as silence is received the
grammar stays in a self loop 202 representing a short
silence (which has a low C05t) rather than proceeding
along a arc 204 representing the first joker word.
The fist joker word is assigned a relatively high
likelihood cost. When speech it encountered, the
score for the short silence becomes poor, exceeds the
core for the joker work Marc ~04), and causes the
grammar to traverse arc 204 to a node 206. At node
206, a second joker word 208 having a slightly lower
likelihood cost than the first joker leads away from
the node. When long silence is recognized, the
grammar traverses arc 210. This indicates thy end of
the word. this method gives relatively good noise
immunity because of the added "hysteresis" effect of
the two joker word sand accurately determines the
beginning and end so the incoming isolated utterance
employed during the training. Toe two different
likelihood costs per frame assigned to the different

~229~
"joker" words have the effect of making sure a
word is really detected (by initially favoring
silence); and when, once a word is detected, the
word it favored (my the second joker) to make
sure Salk is really detected to end the word.
The parameters employed in the illustrate
embodiment, in connection wit thy grammar of Figure 8
are:
First Second short Long
JokerJokerSilenc~ Silence
Min. Dwell iamb 1 1 35
Sax. Dwell Tom }01 51 55
likelihood Co~t190Q lB00 -- -
typical)
Referring to Figure pa, the joker word can
: also be employed to provide immunity to sounds such as
coughs during neural speech recognition. In this
respect/ it is a prowled to normal recognition. In
accordance with that aspect of the use of the joker
word, a grammar graph 220 has a starting node 222; and
so long as silence is received, the grammar stays in a
self loop 224 representing short silence which has a
low cost) rather than proceeding either along an arc
226 representing a first joker word or an arc 22~
: I leading to the star of the spoke grammar. When .
speech is encountered, the likelihood cost for the
first joker word is relatively high, and a transition
occurs along the arc 228 into the spoken grammar graph.
. If however non speech such as a cough,
occurs, it is the value of the first joker word which
provides a likelihood cost that causes the grammar to

( )
~.~29~
I
traverse arc 226 to a node 230. The grammar stays at
node 230, held there by a second joker would on a self
arc 232, until a long silence is recognized. The long
silence allows the grammar to traverse an arc 234 back
to starving node 222. In this manner, the machine
returns Jo its quiescent state awaiting a speech input
an effectively "ignores" noise inputs as noted above.
System Structure
Referring now to Figure 4, according to the
preferred embodiment of the invention, the hardware
configuration of Figure 1 employs three identical
circuit boards; that is, the signal processing circuit
board corresponding to circuit 26, a template matching
and dynamic programming circuit board corresponding Jo
circuit 28 and a second template matching and dynamic
programing board corresponding to circuit 30. Each
board has thy same configuration the configuration
being illustrated as circuitry 218 in Figure 4. The
circuitry 218 has three buses an instruction bus 220,
a first internal data Gus Z22, and a second internal
data bus 224, Kink between data buses 222 and
224 are an arithmetic logic unit (ALUM) 226, a fast 8
bit-by-8 Kit multiplier circuit 228 having associated
therewith an accumulator 230 and latching circuits 232
- 25 and 234, a dynamic random access memory TRAM) 236 for
example having 128~000 words of sixteen bit memory
with latching circuits 238, 240, and 242, and a fax
transfer memory 24~ fur example having 2,000 words of
sixteen bit memory and associated latches 246, 24B~ and
250. rightable control tore 252 effects control over
thy operation of the storage and arithmetic elements.
The control store 252 is a random access Emory having

-
~22~
-43-
an output to bus 222 an providing instruction data on
bus 220. The ruble control store, which may be for
example a OK by 64 bit RAM, stores the program in-
striations and is controlled and addressed by a
S micro sequencer 254, for example an AND 2910 micro-
sequencer. A pick machine 256 is provided for clock
timing, dynamic JAM refresh, and other timing
functions as is well known in the art.
This structure employs a double pipeline
method which enables the use of relatively inexpensive
static RAM for the control store 252~
Important to fast operation for implanting
the Lapels transformation to perform likelihood cost
generation an inverting register 260 is provided for
converting the output of the arithmetic logic unit 226
to a twos complement output when necessary. The out-
put of the inverting register is applied to the bus
224.
The operation and programming of the boards
26, 28~ and 30, is con rolled by the particular program
code, which programming enables the board when employed
as a signal processor 26 to provide the acoustic pram-
ethers necessary to perform the template matching and
dynamic programming. Similarly, the programming of
circuitry 28 and 30 enables the acoustic parameters to
be properly processed for generating likelihood cots
and or implementing thy dynamic programming.
In operation, as noted above, the template
matching and dynamic programming circuits 28 and 30
perform tub ~ikeIihood cost calculations on demand as

-
I
--44--
required by the dynamic programming method This can
be accomplished for two reasons. First, the template
pattern data necessary to perform all of the likely-
hood calculations needed by the dynamic programming
portion of a board will be found on that board. (This
is a result ox the highly structured grammar graph and
the restriction that templates are not shared between
words.) Second board, for example board 28, receives
the necessary likelihood scores it lacks to complete
the dynamic programming for the board. The processor
control 32 orchestrates this transfer of information.
The grammar graph referred to above and
described in connection with Figure 5, is stored in
memories 23S and 244. Because it is stored in a
highly structured manner, the data representing one
grammar graph can be replaced with the data repro-
setting a second grammar graph making the equipment
flexible for recognizing different syntax combinations
of words or entirely new speech vocabularies yin which
case training on the new vocabulary words would have
to be done). In the illustrated embodiment, the data
replacement it preferably performed by storing multiple
grammars in memories 236 and 244 and selecting one ox
the grammars under program control. In the illustrated
embodiment, the process control 32 can include a disk
memory for storing additional grammars.
In addition as noted above, the micro-
processor buy for 34 provides tube capability of per-
farming variable rate processing. thus, the dynamic
programming and likelihood score veneration can fall
behind real time somewhat, in the middle ox a speech
utterance where the greatest computational requirement
... . . . I_

~L22~32~
I
occurs, catching up toward the end of the Uranus
where fewer calculations need to be made. In this
manner, the entire system need not respond as noted
above to the peak calculation reknits for real
time speech recognition, but need only respond to the
average requirements in order to effect real time
recognition it the speaker dependent environment
illustrated herein.
Other embodiments of the invention, including
10 additions, su~traG~ions~ deletions, and other modîfica-
Sheehan of thy preferred described embodiment will be
apparent to those skilled in the art and are within the
scope of the following claims.
What is claimed is:

Representative Drawing

Sorry, the representative drawing for patent document number 1229921 was not found.

Administrative Status

2024-08-01:As part of the Next Generation Patents (NGP) transition, the Canadian Patents Database (CPD) now contains a more detailed Event History, which replicates the Event Log of our new back-office solution.

Please note that "Inactive:" events refers to events no longer in use in our new back-office solution.

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Event History , Maintenance Fee  and Payment History  should be consulted.

Event History

Description Date
Inactive: IPC deactivated 2016-01-16
Inactive: IPC deactivated 2016-01-16
Inactive: IPC assigned 2015-11-26
Inactive: First IPC assigned 2015-11-26
Inactive: IPC assigned 2015-11-26
Inactive: IPC assigned 2015-11-26
Inactive: Expired (old Act Patent) latest possible expiry date 2005-01-17
Grant by Issuance 1987-12-01

Abandonment History

There is no abandonment history.

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
EXXON RESEARCH AND ENGINEERING COMPANY
Past Owners on Record
CHIN-HUI LEE
JOHN W. KLOVSTAD
KALYAN GANESAN
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column (Temporarily unavailable). To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Drawings 1993-07-27 4 139
Abstract 1993-07-27 1 31
Cover Page 1993-07-27 1 16
Claims 1993-07-27 3 107
Descriptions 1993-07-27 45 1,747