Language selection

Search

Patent 1229922 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent: (11) CA 1229922
(21) Application Number: 472292
(54) English Title: SPEECH RECOGNITION TRAINING METHOD
(54) French Title: METHODE DE FORMATION EN RECONNAISSANCE DE LA PAROLE
Status: Expired
Bibliographic Data
(52) Canadian Patent Classification (CPC):
  • 354/50
(51) International Patent Classification (IPC):
  • G10L 15/06 (2013.01)
  • G10L 25/78 (2013.01)
  • G10L 15/10 (2006.01)
(72) Inventors :
  • BAKER, JAMES K. (United States of America)
  • KLOVSTAD, JOHN W. (United States of America)
  • GANESAN, KALYAN (United States of America)
  • LEE, CHIN-HUI (United States of America)
(73) Owners :
  • EXXON RESEARCH AND ENGINEERING COMPANY (United States of America)
(71) Applicants :
(74) Agent: RICHES, MCKENZIE & HERBERT LLP
(74) Associate agent:
(45) Issued: 1987-12-01
(22) Filed Date: 1985-01-17
Availability of licence: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): No

(30) Application Priority Data: None

Abstracts

English Abstract


Abstract of the Disclosure
A speech recognition method and apparatus
employ a speech processing circuitry for repetitively
deriving from a speech input, at a frame repetition
rate, a plurality of acoustic parameters. The
acoustic parameters represent the speech input signal
for a frame time. A plurality of template matching
and cost processing circuitries are connected to a
system bus, along with the speech processing
circuitry, for determining, or identifying, the speech
units in the input speech, by comparing the acoustic
parameters with stored template patterns. The
apparatus can be expanded by adding more template
matching and cost processing circuitry to the bus
thereby increasing the speech recognition capacity of
the apparatus. Template pattern generation is
advantageously aided by using a "joker" word to
specify the time boundaries of utterances spoken in
isolation.


Claims

Note: Claims are shown in the official language in which they were submitted.



-46-

1. In a speech recognition apparatus where-
in speech units are each characterized by a sequence
of template patterns, and having
means for processing a speech input sig-
nal for repetitively deriving therefrom, at a frame
repetition rate, a plurality of speech recognition
acoustic parameters, and
means responsive to said acoustic
parameters
for generating likelihood costs
between said acoustic parameters and said speech tem-
plate patterns, and
for processing said likelihood
costs for determining the speech units in said speech
input signal,
a method for generating said template
patterns comprising the steps of
finding the beginning and end of an
input speech unit surrounded by silence for which tem-
plate patterns are to be generated, and
generating in accordance with a
known procedure, template patterns representing said
speech unit,
said finding step comprising
modelling silence as a template
pattern,
for each frame, comparing said si-
lence template pattern likelihood cost with a fixed
reference threshold value, and
declaring the beginning of said
speech unit when the score for the silence template
pattern crosses the threshold value.


2. The speech recognition template pattern
generating method of claim 1 further comprising the
step of


-47-

declaring the end of said speech unit
when the score for the silence template improves suf-
ficiently to cross a second threshold value.

3. The speech recognition template pattern
generating method of claim 2 wherein said second
threshold value is less than said first threshold
value.


4. In a speech recognition apparatus where-
in speech units are each characterized by a sequence
of template patterns, and having
means for processing a speech input sig-
nal for repetitively deriving therefrom, at a frame
repetition rate, a plurality of speech recognition
acoustic parameters, and
means responsive to said acoustic
parameters
for generating likelihood costs
between said acoustic parameters and said speech tem-
plate patterns, and
for processing said likelihood
costs for determining the speech units in said speech
input signal,
a method for generating said template
patterns comprising the steps of
finding the beginning and end of an
input speech unit surrounded by silence for which tem-
plate patterns are to be generated, and
generating in accordance with a
known procedure, template patterns representing said
speech unit,
said finding step comprising
modelling silence as a template
pattern,



Claim 4 continued... -48-

for each frame, comparing said silence template
pattern likelihood cost with a fixed reference threshold value,
and
declaring the beginning of said speech unit when the
score for the silence template pattern crosses the threshold
value,
declaring the end of said speech unit when the score
for the silence template improved sufficiently to cross a second
threshold value, and
wherein said second threshold value is less than said
first threshold value.

5. In a speech recognition apparatus wherein speech units
are each characterized by a sequence of template patterns, and
having,
means for processing a speech input signal for
repetitively deriving therefrom, at a frame repetition rate, a
plurality of speech recognition acoustic parameters, and
means responsive to said acoustic parameters
for generating likelihood costs between said acoustic
parameters and said speech template patterns, and
for processing said likelihood costs for determining
the speech units in said speech input signal,
a method for generating said template patterns com-
prising the steps of
finding, using a dynamic programming and a grammar
graph, the beginning and end of an input speech unit surrounded
by silence for which template patterns are to be generated, and
generating in accordance with a known procedure,
template patterns representing said speech unit,
said finding step comprising
modelling silence as a template pattern,
associating from a beginning node of said grammar
graph a first arc having a fixed reference threshold value and,
associating with said beginning node a silence self
loop, and following said dynamic programming for determining
the beginning of said speech unit.





-49-
6. The speech recognition template pattern generating
method of claim 1 further comprising the step of
associating with an intermediate node to which sate
first arc leads, a second arc having a second fixed reference
threshold value, and
associating with a second intermediate node to which
said second arc leads, a third arc corresponding to the likeli-
hood score of silence, and
determining when said score at a node to which said
third arc leads matches a predetermined condition.

7. The speech recognition template pattern generating
method of claim 6 wherein said second threshold value is less
than said first threshold value.


Description

Note: Descriptions are shown in the official language in which they were submitted.


I


SPEECH RECOGNTTI~N TYING METHOD




Background of the Invention
The invention relates generally to the field
I) of speech recognition and in particular to the recog-
notion of speech elements in continuous speech.

The need for speech recognition equipment
: which is reliably and reasonably priced is well dock-
mented:in the technical literature Speech recognition
; equipment generally falls in two main categories.. One
category it speaker independent equipment wherein the
: Jo speech recognition apparatus is designed to recognize
elements of speech from any person. However speaker
: independent systems can be quite limited with regard:
to features other than the spookier independence", for
. example, the number ox words in the recognition vocal- ulary. Also, typically, five to ten percent of the
population will not be recognized by such systems. The
other category, speaker dependent speech recognition,
relates to speech recognition apparatus which are sub-
: staunchly trained to recognize speech elements of a
limited class, end in particular the class consisting
I: :

)



of one person. Within each category, the speech fee-
ignition apparatus can be directed to the recognition
of either continuous speech, that is, speech wherein
the boundaries of the speech elements are not defined,
or to isolated speech, that is, speech in which the
boundaries of the speech elements are à prowar known.
An important difference between continuous an is-
fated speech recognition is that in continuous speech,
the equipment must make complex decisions regarding
the beginnings and ends of the speech elements being
received. For isolated speech, as noted above, the
incoming audio signal is isolated or bounded by either
a given protocol or other external means which makes
the boundary decisions relatively simple.

There exist today many commercial systems
for recognizing speech These systems operate in -
either a speaker independent environment was example-
fled for example by US. Patents 4,038,503; 4~227,176;
4,2~8,498; and 4,2~1,329 assigned to the assignee of
this invention) or in the speaker dependent environ-
mint. In addition, the commercially available equip-
mint variously operate in either an isolated speech or
a continuous speech environment.

The commercially available equipment, how-
ever, is expensive when high recognition performance
is required. This is often a result of the best
equipment being built for thy sty difficult problem,
that is speaker independent, continuous speech recog-
notion. Consequently, zany of the otherwise available
applications to which speech recognition equipment
could be adapted have not been considered because of
the price/performance relationship of the equipment.




.

i'; )

go
-3-

Furthermore the commercially available equipment can
not easily be expanded to provide added capability at
a later date, and/or does no have the required awoke-
racy or speed when operating outside of the laboratory
environment.

Primary objects of the present invention are
therefore an accurate, reliable reasonably priced,
continuous speech recognition method and apparatus
which can operate outside of the laboratory
environment and which enable the user to quickly and
easily establish an operating relationship therewith.
Other objects of the invention are a method and
apparatus generally directed to speaker dependent,
continuous speech recognition, and which have a low
false alarm rate, high structural uniformity, easy
training to a speaker, and real time operation.

Summary of the Invention
The invention relates to a speech recognition
apparatus and method in which speech units, for exam-
pie words, are characterized by a sequence of template patterns. The speech apparatus includes circuitry for
processing a speech input signal for repetitively de-
roving therefrom, at a frame repetition rate, a plus
reality of speech recognition acoustic parameters. The
acoustic parameters thus represent the speech input
signal for a f raze time. the apparatus further has
circuitry responsive to the acoustic parameters for
generating likelihood costs between the acoustic par-
emoters and the speech template pattern The
circuitry is further adapted for processing the like-
Lydia costs for determining or recognizing the speech
units of the speech input signal.

)



.~,

A method for generating the template
patterns employed during speech recognition features
the steps of finding the beginning and end of an input
speech unit which is surrounded by silence and
generating in accordance with a known procedure the
template patterns representing the speech unit The
finding step employs modeling silence as a template
pattern, and, for each frame comparing the 5 silence
template pattern cost with a fixed reference threshold
value. The beginning of the speech unit is declared
when the score for the silence template crosses, and
is worse than, the threshold value. This method can
further feature a second threshold value for declaring
the end of a speech unit. When the score for tune
lo silence template improves so that it is better than
the second threshold, the end of the speech unit is
declared. Preferably the second threshold value is
less than the first threshold value.

.

~22g~
. -5-

Brie Description of the Drawings

Figure 1 is a sumac block diagram showing
an overview of the speech recognition apparatus;

figure 2 is a schematic flow diagram of the
signal processing portion of the illustrated embody-
Kent of the invention;

Figure 3 is a detailed flow diagram of the
dynamic programming and template matching portion
according to the invention;

Figure 4 is a more detailed schematic block
diagram of the uniform processing Circuitry portion
:; according to the invention;
.
use 5 is a Semitic representation of a
dynamic programming grammar graph according to the
1;5 : inventions
.
Figure 6 it a schematic representation of a
dynamic programing word graph according to the
invention;

Figure 7 it a schematic representation of a
grammar using a single element joker word according to
the invention;

inure is a schematic representation ox a
grammar using a double jokes word according to the
Jo invention;

I
I

Figure pa is a schematic repre~en~a~ion of a
grammar using a double joker word as a prowl to
speech recognition;

Figure 9 is a diagrammatic block diagram of
a prior art apparatus; and

Flogger 10 is a diagrammatic lattice
presentation of the dynamic programming approach.

Description of a Preferred Embodiment
An Overview

Referring to Figure 1, a speech recognition
apparatus 10 according to the invention receives an
audio input over a line 12. The audio input i us
freed and filtered by a preamplifier 14. The prearm-
plifier 14 his an anti-aliasing filter and provides an
output voltage over a line 16 which has the proper
voltage Yules (that is, is normalized) for an analog-
to-digit~l converter 18. In thy lust rated embody-
: Kent, the a~alog~to-digital converter operates at a
raze of 1~,000~ twill bit conversions per second and
provides tube twelve bit output on lines X0. The
twelve bit output of the analog-to-digital converter
is directed to a buffer memory 22. The buffer 22 has
a capacity for storing up to 320 samples or twenty
milliseconds of speech. the minim requirement it
that buy or I be able to tore slightly more than 160
samples.

The buffer pa is connected to no internal
dicta us I Thy bus 24 acts the primary data l:
monkeyshines buy for byway apparatus The bus 24 thus

)

392~


interconnects buffer 22 with a signal processing air-
quoter 26, a first template matching and dynamic pro-
tramming circuitry I a second template matching and
dynamic programming circuitry 30, a process con-
trolling circuitry 32, an acoustic parameter buffer
memory 34, and a trace back buffer US.

lo recognition process is controlled by the
process control circa 3Z which incorporates, for
example, a commercial microprocessor as the control
lo element. The process control circuitry uses the memory
34 to enable the apparatus to operate with a variate
processing rate. This means that the entire apparatus
need not meet peak load demands in real time but only
needs to operate in real time with respect to the
average processing load. The hardware savings involved
by employing this process configuration art substantial
and are discussed in greater detail below.

Each of the circuitries 26, 28, and 30, in
the illustrate embodiment, are structurally identical.
They are modified by the software programs placed
therein to perform either the audio signal processing
of circuitry 26 or the template watching and dynamic
programming of circuitries 28 and 34. This is
discussed in more detail below. Each of the circuit-
Roy I I and 30 have therein a small 2,000 word
memory aye, aye, and aye respectively (sixteen bit
worms). These memories act us small foist buffers
riding sufficient storage owe continuous processing
in the circuitries 26~ 28, and 30. the apparatus lo
structure can be easily expanded along bus 24 my adding
~ditional template matching and dynamic processing

-a-

circuitries (slush as circuitries 28 or 30) Jo process
a larger recognition vocabulary. This expansion gape-
ability is a direct result ox the chosen architecture
which dictates that template matching and dynamic peon
cussing be executed Snow the same machine board, and in
this protocol embodiment by the use of identical
circn~itri.es for circulators 28 and 30.

In the illustrated embodiment, each of the
circui~cries 26, 28~ and 3û occupy one complete board
of the assembly While it would have been desirable
to comb no two of the data process in boards, the pry-
steal size of the combined board would Nikko fit into
the wrack", and hence in Todd se~icondwtor tech-
neology, it way not feasible to combine the boards.
The cricketer which has been developed, however, not
only allows new boards to be added along bus 24 as de-
scribed above, but reduces the data communications
slog jay that 'cynically occurs in prior apparatus
using separate 'complete matching and dynamic pro-
: 20 tramming circuitries (see for example figure 9 which
shows a signal processing circuit 46 feeding a them-
plate matching I rcuit 48 and thy template snatching
circuit 48 feeding a dynamic programming circuit 50).
As a result of using separate template matching and
dynamic programming sircoitries, bandwidth problems
invariably occur at the intrinsically high bandwidth
conrlection 52 and must be solve. In the illustrated
e~bodi~n~nt ox thy invention the apparatus structure
Lowe for parallel processing ox dunned as described
pa elm err reducing the bandwidth requirements
along thy buy, corKesponding~y decreasing the cot of
the apart.


~.~299;2~
g_

The speech recognition apparatus and method
implemented by the illustrated embodiment processes
the audio input over line 12, in the signal processing
circuitry 16, to provide a set of acoustic parameters
S characterizing the input speech. It is these pane- . ,
meters which are stored in memory OWE The set of
acoustic parameters can be thought of, in the thus-
treated embodiment, as a Yea or of sixteen eight-bit
numbers or components. A new vector of acoustic pane-
meters is generated each frame time, a frame time in
the illustrated embodiment being ten millisecond

The acoustic parameters are called for, on
demand, by the first end second template matching and
dynamic programming circuitries. these circuitries, in
lo general operation compare each vector with prescored
reference template patterns, only as needed, and gene-
Rae likelihood statistics or cysts representative of
the degree of similarity. The prowl stored rev-
erroneous template pa terns characterize the speech eye-
mints to be recognized using the same processing
methods employed in generating the acoustic pane-
meter. Each speech elements for example a word, it
characterized by a sequence of template patterns as
described in more detail below. A dynamic programming
I method it employed to venerate a recognition decision,
based upon the likelihood statistics or costs, in a
- manner Waco alloys the apparatus to operate in real
time.

the description below first describes how
the ~cou3tic parameters are generated. It then details
the processing of the acoustic parameters, by air-

~2;~9~

--1 o--

quoters 28 and 3û, to obtain a recognition decision.
It is important to note an appreciate that: circuit-
ryes I and 30 are morn than identical in tractor;
they operate in parallel in order to distribute the
processing load and further enable real time operation.

Audio Signal Processing
Purring now to figure 2, an acoustical
input 100 (over line 12 of jig. I is passe at 101 to
an AND converter ~18 in jig. 1) after the necessary
normalization and buffering (shown in Figure 1 by pro-
amplifier 14). the output ox the AND converter, after
buffering by memory 22 pig. 1) is processed by the
signal processing circuitry 26 as follows.

In the signal processing description that
follows, the processed values of the data will often
be clipped or normalized so that they fit into one
eight bit byte. This is done o await a subsequent
multiplication Andre accumulation does not produce
any overflow beyond sixteen bits, and with respect to
normalization to make best use ox the available dynamo
to range.

The twelve bit output of the A/D converter
is differentiated and slipped it 102" The input is
dif~erentiaked by taking the negative first differences
between successive input phallus. This Occurs at ho
16 phase sampling rate. the differencing procedure no
dupes the dynamic rang ox the input waveform and pro-
emphasizes the high frequency. In the frequency
domain, thy effect of differentiation is multiplica-
troll by fogginess which royalty in a six do per octave

Lo


Bassett for the high frequencies. This high frequency
reemphasis is desirable because the amplitude of the
speech signal decreases as a function of frequency.
The differentiated acoustical signal it then clipped
so that it lies into one byte.

The average amplitude, the mean, and the log
ox the mean squared amplitude ox the differentiated and
clipped output is then determined at lD4, 105 for, in
the illustrated embodiment a window" having 32~
samples or twenty milliseconds of speech. the log
used here is:
8 OWE (amplitude) - 12~ I 1)
The result is then clipped to fit into a single byte.

While the "window used here is twenty mill
liseconds in duration, it is important to recognize that, according to the invention, the signal processing
circuitry to intended to generate a new set of acoustic
parameters was described below) every ten milliseconds.
Successive windows therefore overlap, by ten Millie
seconds in the illustrated embodiment.

Next, the differentiate and clipped data ox
the twenty millisecond window is normalized it 106 by
subtracting therefrom the mean amplitude over the
"window. This it in effect equivalent to subtracting
thy zero frequency component, or DC level, ox the sign
net. Thy normalized date it clipped again to fit
within jingle byte.




.

i ^ )

~29~392~

-12 I

The normalized output from block 186 is then
windowed at 108. Windowing is multiplication of the
input array by a vector of window constants at 109
which has the effect of attenuating the data at both
ends of the window. In the frequency domain, the
height ox the side lobes is thereby reduced at the
cost of increasing the width of the center lobe thus
smoothing thy Y~sulting spectral estimate. Nubile
various types of windowing functions exist, each
producing slightly different tradeoffs between side
lobe height and center lobe width, the window chosen
in the illustrated embodiment and which has been found
to produce statistically better results is a
sinc-Xaiser window which consists of multiplying the
sine function (yin Ajax by a Raiser function.

the Raiser window is useful since it pane-
motorizes ache trade-off between side lobe height and
center lobe width. Multiplying by thy sine function
gives a bandwidth at each equines it the Fourier
transform. In the illustrated embodiment a bandwidth
of 355 hertz it used. In the Raiser Junction window,
eke beta parameter, }I, i set at 5. I.


.

--13--

The Raiser function, which is described for
example in digital Filters, chapter 7 of System
Analysis by Dip Computer, Rut and raiser, John
Wiley & Sons, New York, l9~6, is given by
I. 2)
It to ( in/ ( (I Jo) 12 + 2 1) I ) ) 1/2
I
IolB)

where It is a modified zeroth order Bessel fungi on
of the first kind:

l (Ego pa)
It z couches coy 5)d9~
The sine function for the parameters of the
lo illustrated embodiment is:

.
Icky. 3)
sin (a on - tN-11/2))
a in - to lo (n 0,l,...,N-l) and
N 320 points in the
window
where a I ~355i2) ~0697
16000
The waveform is windowed after normalization
(rather than before normalization) wince otherwise the
}5 mean would introduce an extra rectangular signal which
would increase the wide lobes. The windowed waveform
is block normalized so that its samples fit in this-
teen bits This is Jo that accumulation performed
during folding, as described below, dots not overflow
sixteen bits.

The discrete Fourier transform of the win-
dower date is no performed at 112. Chile where are
any way of efficiently performing Fourier triune-
or in order to educe the number of multiplications
and bench thy time to effect the Fourier transform

i ` )


-14-

computation, a folding technique at 113 making use of
the symmetries of the sine and cosine, by converting
the data vector into four smaller vectors, is
performed. Since the sampling of values in the
frequency domain is done at multiples of a vase
frequency, each of the resulting vectors will have a
length ox 1~4 the period of the base or fundamental
frequency, also called the base period.

on the illustrated embodiment, the base ire-
I quench from a 16 kilohertz sampling rate and a 20 mill
lisecond window it chosen to be 125 hertz. (The eon-
responding base period it 128 samples.) This repro-
sets the minimum chosen spacing between spectral
frequency samples. As noted above, a 20 millisecond
window encompasses two lo millisecond frames of the
incoming acoustic (and digitized) signal. Successive
winds thus hove an or lapping character and each
sample contributes in two windows.

According to the folding technique, frequent
Shea aye divided into odd multiple and even multiples
of the base frequency. The real (multiplication by
thy cosine and the imaginary (multiplication by the
sine) components of the transform will each use one of
thy resulting forded vectors for both Lucy of
f rewaken

The fording operation is performed in three
tops Fir t, Abe laminates which are offset by the
bay period are grouped together. In thy illustrated
embodiment, using a base frequency ox 125 hertz at a
supplying rate of 16 kilohertz, this results in 128


-15-

points. This fulled is a direct consequence of the
expression fur the Fourier transform:
F~nf) = k I eye jnkf) Peg. 4)
where f is the base frequency (125 ho) and the sum is
extended from k = 0 through the number of the samples
in the window (less 1). Since
eye) = e joy I Jo (En. 5
ho transformation can be rewritten as:
no k ilk elm jnfk) let. 6)
where the sum is extended from k - 0 through k = 4~-1
and Q is equal to 1/4 of the base period and xl(k3 is
thy sum of ok + x(k~4q) x~k+8g)

: The second fold operation is performed by
` rewriting the last expression Equation as:
(Fog- I
Jo :15 hi ~-~ Al lcost2~ no + j Sweeney nfk)).

We can then use the symmetries of the sine and cosine
functions sin Sweeney -Sweeney- a) and C05 lay
cozily- a), the transform of Equation 7 can be rewritten
as:

F(nf)= kx2c ooze nfk) US Sue nfk) (En
where
X2c = Al I + Al (cluck) and
x2s Al lo) - Al l4Q-l-k).
. The sum extends remake 0 to k - clue. Thus there
art I term where the sampling rate it 16 kilohertz
and the base frequency it 125 hertz.




- . .

)
~L~;29~%;~

-16-

The third step of the procedure resolves the
instance of odd multiple of thy base frequency. The
symmetries Santa) Sweeney) and coy = -Casey)
allow the transformation Of Equation 8 to be rewritten
as:

no = kX3~0 Casey nfk~ SO inn nfk3

. eke. 9)
where
X3co X2 Cook - x2 Casey k) and
X350 = X2 Seneca x2 sunnily k).
For even multiples of the base frequency the equations
are:

F~nf) kX3CE ooze nfk) x3SE Sweeney nfk)
.
I . 10 )
where
x3cE s x2 coy (k) + x2 co. (2C - lo and
zoo I yin (k) - x2 yin I k) .
This proceeder uses the equalities of sin lay 3
-Sonora) and cost co l2rr-a). Thy sum, after the
third prosier or third iEold, now extend from zero to
k Q-l. 'rat it, 32 term fur sampling await of }6
2Q kilohertz and a base frequency of 125 hertz. After
the three folds, the vector ore block normalized to
Seiko bier.

At this point the assort Fourier transform
it completed by ~ultip3.ying the dodgy wrier the iEolding
procedure by a matrix ox Hines and cosines (aye). Eye
calculating the multiple of the bye ~r~guency,

~2~3~2~

17

module 2, the set of folds which are needed to be used
can be determined. the resulting vectors are block
normalized to fit within a single byte The result of
the Fourier analysis is two vectors ox signed integers,
the integers ranging from -128 to 127 (that is, one
byte), one of the victors containing the real terms of
the Fourier analysis and the other containing the
imaginary kerns of the analysis. The length of the
vectors is equal to the number of frequencies employed
during the recognition process. In the illustrated
embodiment the number of frequencies is thirty-one and
are:

250 1250 2250 3250
375 1375 2375 3500
~15 500 1500 250C 3750
625 1625 2~25 ~000-
750 ~750 2750 ~50~
~75 lB75 2875 5000
1~0 20~0 3003 5~00
1125 2125 3125
.




The next step, at 114, it to determine the
sum of the sguarPs of the real and imaginary parts of
the spectrum at each multiple of the vase frequency.
The result is divided by two, that is, shifted down
one bit, so that the square root which is computed at
116 will fit into one byte
.




After the square root it taken at 116, the
resulting spectrum it transformed, it 118t according
to the equation:

fox) 128(x - monks Jean). (En. 11)

Z~3~2;~
- 18 -
1 This function, often called the "quasi-log" and described
in more detail in US. Patent No. 4,038,505, enhances the
speech recognition properties of the data by
redistributing the dynamic range of the signal. The
"mean" is the average value of the array of spectral
values.
The result of the quasi-log is a vector having,
in the illustrated embodiment, thirty-one acoustical
parameters. Added to this vector, at 120, is the mean
amplitude value ox the- input signal calculated at 104 an
105. This amplitude value and the 31 element vector
resulting from the quasi-log step at 118 are applied to a
principal component transformation at 122. The principal
component transformation reduces the number of elements in
the vector array to provide a smaller number of acoustical
parameters to compare against the speech representing
reference template patterns prescored in the apparatus.
This reduces both computational cost and memory cost.
The principal component transformation, at 122r
is performed in three warts. In the first step, a speaker
independent mean for each element of the vector is
subtracted from the respective element in the vector.
Second, the result of the first step is multiplied by a
speaker independent matrix which is formed as described
below. Finally, each element of the resulting vector
from the second step is individually shifted by an
element dependent amount and then is clipped to fit
within an eight bit byte (see 123). This essenti~
means that the distribution of each component of
the vector has been normalized so that

-19

the byte will contain three standard deviations around
the mean of the component. The output of the
principal component analysis, an array ox bytes having
a vector length equal to a selected numbs of
components, is the acoustic parameter output of the
signal processing section 26 of the apparatus of
Figure 1 and is applied to the bus 24 or further
processing as described in more detail below

The principal component analysis is employed
to reduce the number of acoustic parameters used in
the pattern recognition which ~ollcws. Principal come
potent analysis is a common element of many pattern
recognition systems and is described in terms ox an
eigenvector analysis, for example in US. Patent No.
4,227,17~, assigned to the assignee of this invent
lion. the concept of the analysis is to generate a
set of principal components which are normalized
linear combination of the original parameters and
where the recombination is effecter to provide come
puniness Seth maximum information while maintaining
independence from each other. Thus typically, ache
f first principal component I the normalized linear
combination of parameters witch the highest van lance .
The concept s that the direction containing the Into
variation Lowe contains the most information as to
which class of speech a vector belongs to.,

2~s~r~ally, the equator containing the linear
combinations under consideration are constrained to be
ox lit length Tbisl~ old be the bet procedure if
all recombined parameter had the Seiko Yariat~on with-
it the defined speech classes, but unfortunately they
do Nikko. The prior ark analysis method it thee for

I .
I ;

modified to normalize all linear combination of pane-
peters to have the same average, within class, van-
ante and the first principal component is Casey to be
that normalized linear combination with tile highest
across class variance.

kiwi principal component analysis is derived
as follows. Let I' represent the transpose of a matrix
. Let be the caverns matrix ox some jet of P
original parameters. Let W be the average within class
covarianc~ matrix. Let V be a P dimensional vector of
coefficients, and let X be a P dimensional vector ox
acoustic parameters For simplicity of derivation,
assume that the original parameter vector X has a zeta
mean. In practice, in the illustrated embodiment, the
15 assumption it implemented by subtracting from the oft-

genial parameters, their respective mean values. The variance of a linear combination of parameters VEX,
is
ten. 12
Expectation of [(V'X~IV'X~'] V'~XX')V TV

wince 1 it defined as the expectation of XX'. Semi-
laxly, if I it the caverns of the ilk lass of
parameters, then twelve is the covarianc~ of the
parameter VEX in thy ilk clays. The average within
class variance it thin

I Viva ego 13~

where is the number of classes. By the distributive
property so matrix ~ultiplication5 this Jo jut equal
to Y5WV. This euphony ion weight all classes eguallyJ




.

~L229~

-21--

independent of the number of samples contained in each
class Actually, in the illustrated e~bo~iment, all
sample were weighted equally on the assumption that
it is more important to discriminate between classes
that occur frequently. .
Next TV is maximized under the
constraint: VIVA 8 1 (En. 14)

This problem cay be solved by using Lagrange multi-
pliers. Let the one Lagrange multiplier be denoted by
y. ye then want to solve (En. 14) and eke. 17) (below)

TV - y~V'WV q. 15)
O a f do TV - 2y~V (En. 16)
TV - yWV (En. 17)

This equation (17) is just the general eigenvector
:15 problem and is solved by P eig~nvectors with real
eigenvalues, due to the symmetry of T and W.

The solution which maximizes the original
criteria is given by the eige~vector Y with the largest
eig~nvalu~ y. This can be seen by observing that the
variance ox VEX is given by .:

VET Viva yV'~V y 1 y Icky. lo)

et the solution to leg }7) with the largest
eigenvalue be Al, and the corresponding eigenvalue
be Ye- For the next parameter, solve for the linear
combination ox parameters TV which maximizes TV
which ha unity within clays variance, (VIVA lay and
which is unrelated with Vex the condition that

~L22939~

-22-

VEX is uncorrelated with Vex allows the analysis to
find the linear combination of parameters which has
maximum across-class variance compared to within class
variance, but which does not include information
already available from the first component. a thy
definition ox uncor~elated, we have:
. eke. I
Expectation of TV Vex I TV lXX~V~ TV O
Let y and be Lagrange multipliers, and solve (Equal-
lions 14, 19 and 21)s

g TV - Ytv'wv z(2V'TVl) eke I
dgJdV TV - 2yWV - 2zTVl ten. 21)
Multiplying through on the left by Al and dividing
my 2:
0 = V TV - ye 'TV - zVl'$Vl (En. 22)
Substituting Vl'TV - 0 as a constraint: and Wavily =
TYl/yl from (En lo) in these relations; and mull
tiFlying through by -1;
0 yule) yule TV) + Zulu 1 (Erg. 23)
0 Zulu = Zulu or (En. 24)
z o eke. 25
Therefore, given Al, the following three equations
can be solve:
TV ye ten. 26)
TV 1 (En. 27)
V'~Vl 0 eke, 28)
Any vector which showoffs I 26) can be turned
into a Hector which sail lies both eke. 26 and 27) by
dividing by toe cowlicks COOK

Consider any two vector, Q and R, satisfying
I. 20, with corresponding eigenval~es q and s. when

~2~9~

-Z3~

rQ'WR = TRY = RUT = grow = or eke. 29)
~r-q)Q'WR = 0. I 30)
Of r is not equal to then
0 - OR - Q'TR/r eke. Al)
S Q'TR 0. let. 32)
If (En. 26) it suicide by two vectors with different
Nancy eigenvalues, then those two vectors Allah sat-
issue (Eke 28~, end ore therefore uncorrelated.

The second component is thus chosen to be the
lo vector Ye which satisfies Peg. 26 and 27) and which
has the second largest eigenvalue, assuming the two
largest are distinct. Correspondingly, the nth come
potent on can be made uncorrelated with TV through
Vn_l and the same procedure can be employed to show
that Van must only satisfy (En. 26 and 27), so long as
there art n distinct nonzeros eigenvalues.

In this anywhere, assaying that the character
fistic polynomial corresponding to (En. 26) has N disk
tint Nasser roots, a sequence of N linear combine-
lions so parameters each one maximizing the ratio
; tVqTV)/~Y'WV~, while constrained to by uncorrelated
with the previous linear combinations in the sequence,
can be detrained This sequence comprise the gent
erased principal components.

In thy method described above, the line r
combinations of parameter that maximize STY while
holding V 'NY continue are found ionic, as can be
easily shown/ T B, By is the average between-
Claus variar,c2~, the same royalty can be obtained by
pa accessing TV while hording Yip constant.. wince B
can by expressed a the of terms involving




.. . . . .. . . . .. . . .... .. . . .. . . .

~L2;2~32~

-24-

finesses between pattern means, intuitively, by maxim
sizing TV the pattern means are being moved apart
from each other It may be that in speech recognition
it is more important to distinguish between sore pat-
terns than between others. If the differences between
pattern Deans in the for~ulakion of ore weighted by
different amounts, the resultant principal components
will be biased lo differentiate more between some pat-
terns than others. One instance for assigning dip-
fruit weights to different pattern pars it when data
and patterns from different speakers aye used in the
principal components analysis. In such a case it is
not worthwhile to try to separate patterns from dip-
fervent speakers, and all such pattern pairs should
receive the weight of zero.

ho creation ox the principle component ma-
trip takes place prior to the real time speech recog~
notion process. Typically, a large data vase is
required to obtain a good estimate of the matrix.

The ox put of the principal component trays-
formation is thy acoustic parameter vector. A new
victor ox set ox acoustic parameters is made avail-
able each frame time, that i , each ten milliseconds
OR the illustrated embodiment. The acoustic parade-
I lens are available on buy I from the signal process-
in circuitry 26. Since each frame time has a window
representing twenty milliseconds of input audio, there
it a note above, an overlap in thy information rep-
etude by the acoustic preheater data. The acoustic
putter data it Tory in thy eye buyer OWE A
noted above, and under the control of the proves con-
troller 32, to allows a virile sate data process-

i .

I

~25~

in by circuitries 28 and I This is important in
order to allow the entire apparatus to meet the aver-
age real time requirements of analyzing the incoming
speech, buy not to require it to maintain real tare
processing throughout each speech element on a frame
by frame basis.

On a frame by frame basis the maximum them-
plate matching and dynamic programming date processing
ruminates generally occur toward the middle of a
speech unit, for example a word Correspondingly, the
processing requirements at the beginning or end of the
speech unit are generally substantially less, in fact,
less than the capability of the described apparatus.
Thus, the use of buy for 34 to stove acoustic data in
combination with the capabilities of the circuitries
28 and 30, allow the average processing rate of the
apparatus to be greater than that required fox real
time operation. Thus real time operation is achieved
without requiring the hardware to meet instantaneous
peak processing rates.

Likelihood Computations
The purpose of the likelihood computation is
to obtain a measure of the similarity between the in-
put acoustic parameter prom the principal component
transformation and the template patterns which are
elude for describing the elements of speech to be
: recognized. In the illustrated embodiment, each
speech event it decided by sequence of template
patterns. Associated with each template pattern is a
minimum and a maximum duration. eye combination of the
template pattern and the minimum and maximum durations
it an acoustic kernel. Typically the speech element

9~2~

-Z6-

is a spoken word although the speech element can also
be a co~blnation of words, a single phoneme, or any
other speech unit

In aocordan e with the present invention,
the chosen measure of comparison is a monotonic fake-
lion of the probability of a particular speech them-
plate given the observation of acoustic parameter at
a given frame time. The acoustic parameters can be
thought of as the observation of a random variable.
the random variables which model the acoustical pane-
meters have a probability distribution given by the
pure unit of sound intended by the speaker. Since the
sound which it received is not pyre and for ease of
. computation, the probability distribution which it
chosen i the Lapels/ or double exponential duster-
button. Abe Lapels distribution is written as, for a
jingle random variable,
(X) a: go (En. 33)
where u it the mean of the distribution, b is inversely
proportional to thy standard deviation, and c is such
what the density integrate to one. In order to make
the computation easily, the logarithm of the luckily
hood rather Han the likelihood itself is chosen as
the measure to be employed. (This allows addition
rather than multiplication Jo be employed in cowlick-
lying probabilities.) This can be accomplished since
the logarithm it a ~onotonic function of at argue
mint. Thrower thy err cay be rewritten us:
in in o bulb (En. 34)
In this computation only u and b need to be known since
thy natural log of c it determined by thy condition




.

~g9~2
-27-

that the density must integrate to one In order to
keep most numbers positive, tube opposite or negative
of this measure is employed. The acoustic parameters
for a given frame are assumed to be independent so
that the final expression for the measure of
likelihood probability becomes
Cost = six - Gil bit) eke. 35
where the sum extends over all acoustic parameters,
and R is a function of the b (i) .,

In the illustrated embodiment, the likelihood
computation is generated for a template pattern Jon
demand" for use by the dynamic programming portion of
the apparatus. Where, as here, there are two circuit-
ryes 28 and 30 operating in parallel, it is possible
}S that the dynamic programming portion of circuitry I
or 30 requires a likelihood score from thy other air-
quotes 30 or 28. this would require the tan mission
of data over bus 24. The dynamic programming Davy-
Zion of labor according to the grammar level and word
level griefs descried below is chosen to minimize
this requirement for the transmission of data on bus
24.

The input to the likelihood calculation, as
noted above, consists of the acoustic parameters at a
frame time and the statistics (u and b above) describe
in the template pattern. The template pattern stay
tistics consist of a Jean (ui3 and a White (by)
for each acoustic parameter, and a logarithmic term
corresponding to I In the template pattern stay
title creation, the logarithmic tern usually has
been shifted the right divided by a power ox 2) by
a quantity which it elected to keep the magnitude of
the C08t within a sixteen bit integer. For each

~:2~2~:
- 28 -
1 acoustic parameter, the absolute value of the difference
between the acoustic parameter and the mean is determined
and that quantity is multiplied by the weight associated
with the parameters. These quantities are added for all
of the acoustic patterns; and the sum, if it is less than
the maximum sixteen-bit quantity, is shifted to the right
by the same quantity applied to the logarithmic term so
that the logarithmic term can then be added thereto. The
result is the likelihood or "cost" for the template
pattern at that frame time.
The Dynamic Programming Approach
A dynamic programming approach to speech
recognition has been used and described elsewhere, for
example, in the applicant's Canadian pat nut numbers
1,182,222, 1,182,223 and 1,182,224 all of which issued
February 5, 1985. the dynamic programming approach used
herein is an improvement upon the dynamic programming
approach described in these Canadian patents.
Referring to Figure 10, the dynamic programming
approach can be thought of as finding the best path 150
(that is, the path with the minimum score through a
matrix 152 in which the rows are indexed by time
indiscrete intervals (corresponding to the discrete
time intervals for which measurements are made) and the
columns represent elementary units of speech (acoustic
kernels in the illustrated embodiment). In theory,
it is possible to try all possible paths through the
matrix and choose the best one. Mover there are
far too many paths to consider each time and hence
it order to find a computational

:3
2~2

-2g-

efficient method and apparatus for finding the best
path through the mattocks a Markov model fox the speech
is considered. A stochastic process is said to be
Markovian, if the probability of choosing any given
state at a time to depends only upon the Nate of the
system it time t, an not upon the way in which that
state at time t, was reaches.

In push, there is coarticulation, the
state by which a given elementary unit of speech
act both those units which are spoken before it
and after it. (Units of speech have an effect upon
the past because a speaker anticipates what he is
going to say) In order to work around the coaeticu-
lotion problem within a word, the template patterns
are formed for thy coarticulated speech units. Iris
method make it very difficult to share template be-
Tony words which theoretically have the same speech
units and it why, in the illustrated e~bodi~en~ of the
invention, the apparatus does not attempt to shave
such templates. For the purposes of the illustrated
embodiment, coarticulation between words is ignored

Thus, the Markovian model for speech is quilt
by including within each state all the information
relevant to future decisions. Thea the unit of speech
are grouped into word because ultimately it is words
which will be recognized and it is at the word level
that syntactical constraints con and, in the thus-
tryout emanate, met be applied. Syntactical Jon-
truants are represented by a Gomorrah graph 158 (Fig.
I and it it the grammar graph that make the yodel
~rkoYlan~ her fore when recsgnizin~ utterances,
tube state spate through which the path mutt ye found

~2~g~2~

--30--
viewed as logically existing at two levels, the gram-
mar or surtax level and the Ford level at tush the
elementary speech units exist.

At the grammar level the state puke consists
of a number of connected nodes., A node is a logical
punk in time wise lies either between, before r or
after individual words within an utterance. At each
node there is a fixed legal vocabulary each word or
Ward of which connects the node to a new node.
grammar graph thus consist of a list of arc, an arc
having a starting nod, an ending s ode, and a lit of
words Welch cause the transition there between (see
Fig. 5). (For "Sophie arcs, the starting and ending
nodes are the same. )

the second level mentioned above employs
word models. A word model it D finite tote repro-
sensation of a particular word as spoken by a portico-
far speaker accordance with the illustrated
embodiment, the word model employed is a linear ye-
I quince of acoustic Cornelius A noted above, an
acoustic kernel is a jingle acoustic template pattern
having a ilium and raaximum duration. Irk the if-
lust rated embodiment, a word thus consist of a so-
quince of sound (each represented by a template pat-
tern), wish a minimum and maximum duration of time
being associated with Mach only. There is no prove-
for ~lterDate pronunciation end on accordance
with the preferred embodiment of thy invention, the
method it plemente~ for speaker dependent speech
recognition. Thus, the Dl~thod relies upon thy bet
estimate that thy Sara speaker soys the aye word in
roughly toe tame Wry at I tires.

~2~9g2~

-31-

In a graph fort roaring the Fig. 6, each
word model acoustic kernel has a minimum duration of
no samples and it represented by n iden~ic~ nodes
160. These are different from the grammar nodes
mentioned above. The on nodes are strung together in -
a series, each node having a single arc coming into
and a single arc leaving it. the maximum duration,
that is, a duration greater than the minimum duration,
can be repro enter by last node having an arc coming
lo in, an art leaving, and a self loop which is the
optional dwell tire, that is, the difference between
the minimum and maximum durations. All of the arcs
have the same acoustic template pattern associated
therewith, and for the self loop, a count of the
lo number of times through the loop must be kept in order
to accurately maintain all of the information (which
is needed later during trace back

The word model graph and the grammar model
graph are integrated by replacing, in the grammar
graph, each arc with the corresponding word models.
The connection between the grammar nodes and the word
node is made by what are called null arcs. pull
arcs also allow the apparatus to skip over optional
words in the grammar, for example arc 162 (Fig. 5).
.

~L22~32~
- 3 2

Once the needed likelihood comput,Eltions are
availably the method proceeds to recognize speech
using those luckily statistic and the allowable
syntax graph of the incoming speech. Pictorially
then, the graph of the utterance is f first transformed
into a lattice, within a metric, as follows. see,
e.g., Fig 10) Each state or node of the graph eon-
responds to a column of the lattice and each Roy of
thy lattice corresporlds to a specific frame time.
Thus, a lattice state in row I corresponds to time I
and a lattice state in row J corresponds to a time J.
Thus, traversing the lattice between row I and row 3
corresponds to obtaining thy acoustic parameters for
the times between an including times Ill end J while
traversing in the "graph", the archways whose origin
node corresponds to the original column and whose end-
in node corresponds to the destination column.
~tnposing minimum and maximum durations upon Tao them-
I plate patterns corresponds to reserictin9 the vertical
2Q distance that eke lattice arc can spud between two
columns

Pi main thrust of the dinosaur: programming
employed in the present inverltion it, at: each row or
time in the lattice@, find the optimal path to a desk
~inal:ion lattice kowtow using the previously computed
cots of the Tao in the rows bitterly the Dyson-
lion row an the tatting row. The optimal or best
path" is equivalent to minimizing the cumulative like-
Lydia kiwi by choosing thy template pattern which
I x~mize~ the ~onditiol~al probability that the speech
unit ~:orrespondlng to thy tempt it the correct one
given the acoll~tic: parameter at She frame 'crime. That
conditional probability ximiz~d over all of the

)

I

-33-

active templates native templates are defined
below).

More specifically, the dynamic programming
performs the following steps at each frame time:
.
1) All nodes are initially set to an initial
maximum tl6 bit) likelihood score.
2) Destination nodes of null arcs can
inherit their score from the source node
. of the arc in zero time.
3) Each active kernel in a word on each
grammar arc is processed using likelihood
computations and duration information
and a minimum score for thy word at what
: : frame time is determined.
~15~ A) If the minimum score for the word is
greater than ox predetermined thresh-
old, the word is deactivated to reduce
Jo : computations with successive frames.
This is effectively a method for reduce
I: a in computation based on the expectation.
that this path will not ye the optimum
one.
5) The cumulative likelihood score of paths
at groomer nodes, thaw is, at the end of
; - ~25 words leading to a grammar node, is come
put.
I) If not all ox thy kernels of a word art
: active, and the score of thy last active
kernel it lest than owe preselected
: 30 activation threshold, the next kernel of
toe word it jade active.

~L2;~3Z~ -
--34--

I To the score at the inlay grammar node
of the graph node 200 of Fig 5, is
better Lowe less than the score at any
i note rued i a lo g r ammo r nod e p liken an end
of utterance has been detectell.

Considered in more Doyle at the acoustic
kernel 1 evil, the dynamic programming uses the w cod
score for the source node of the kernel, the cost of
the acoustic kerns calculated from the current frame,
}O and the global us score of the last frame, to
awry at: the likelihood score of a particular kernel
at a present frame time. As noted above, the portico-
far hardware embodiment described herein determines
the likelihood kowtow 'ion demand". Thus as the like-
. Lydia costs are required by the kernel level dynamic
programming for a particular frame time, the I Cole
hood computation it generated. Each node correspond-
go to the kernel (recall, referring to Fig. 6, that
the: kernel is modeled by a plurality o nodes, one for
each equip frame time duration) can inherit a the
eddy scorn" a likelihood score from a preceding
nod. err thy first node of a kernel, the seed
Corey, us inherited from the last node of a preceding
kerrlel, unless the first kernel node it the first node
long a grammar arc: in which case the seed Skye is
inherited from thy last nod@ of 'eke bet score leading
unto the grammar node. ) In addition, thy last node
baying the kernel con irlh~rit score Roy it'll
because of thy e of the en loop in the word
Lyle) in which aye the number of time that the self
loop has been tratrersed Nat by recorded, In order to
up the accumulated swept a stall possibly all
of thy lik~lihcod Corey are normalized by subtracting

~29~
- 35 -


1 the global minimum score (that is, the best score) of the
last frame. The new score is then the sum of the
inherited score plus the likelihood score or cost for that
template at that frame time. When all of the "active"
kernels have been processed, the minimum score for the
word is determined and output to the corresponding grammar
node.
If the minimum score for the word is greater
than a selected deactivation threshold, all kernels of the
word are made inactive except for the first one. This has
the effect of reducing the required likelihood and dynamic
programming computations at the risk of possibly
discarding what might become an optimum path. On the
other hand, if the score on the last node of the last
active kernel of a word (where not all kernels are active)
is less than an activation threshold, then the next kernel
of the word is made active. If all of the kernels of the
word are active, the destination node of the current
grammar arc receives a score which is the minimum score of
all the words on all the arcs leading into the grammar
destination node.
The dynamic programming employed here is similar
to that described in Canadian patent number 1,182,~24
noted above. The primary difference between the dynamic
programming employed here and that described in the
earlier patent, is the use here of the null arc and the
activation/deactivation threshold The use of the null
arc in this "grammar" advantageously provides the ability
to concatenate words which enables an easier
implementation of the apparatus according to the
invention. And, as noted above the activation/
.

)

3L2~9~

~36-

deactivation thresholds reduce the computational
demands on the apparatus.

In the preferred embodiment of the invention,
a partial trace back is employed as a memory saving
device. At each frame time, all nodes at a time in
the past, equal to the present time minus the maximum
duration ox a word, are chocked to see whether there
is an arc leading from that node to any node at a
later time. If there is no, these nodes are Eli
noted from the set of nodes which can be used intraceback. Recursively, all nodes in the further past
that have arcs leading only to the deleted nodes are
in turn deleted. There results therefore a pruned set
of trace back nodes which enable less memory to pa em-
plowed for the trace back process.

: Once end~o~utterance has been detested, for
example, when the final grammar node has a better
(lower) score than any other grazer node of the Papa-
fetus, a forced trace back method is employed to de-
termite, base upon the results of the dynamic pro-
tramming over the length ox the utterance, what was
a Sally the best word sequence. the tra~eback starts
at the final node of the Grimm graph and progresses
: backwards, toward the beginning of the utterance,
through the best path the output of the trace back it
the recognized uterine long with the tart and end
time, end if desired, a score for each word. Thus,
the output of the least cost path March it the most
probable Utahans which is consistent. with the ape-
gifted grammar the puzzled Ford oddly and them-
plates, and the input acoustic parameter. Further

~2Z9~D~2;2

37-

information necessary to train on the Utahans is
also available from the system as described below.

Trainîn enrollment
The description presented thus lair describes
s a method for recognizing speech using a plurality of
previously formed template patterns. the formation of
those template patterns is key to providing a speech
recogni ion system which is both effective and felt-
able Consequently, care end attention must be given
to the creation of the template patterns. In portico-
far, for the illustrated embodiment of the invention,
the apparatus is designed to be speaker dependent and
hence the template patterns are specifically biased
toward the speaker whose speech will be recognized.

Two steps are described herein
after for adapting the apparatus to a specific speak-
or. In a first enrollment step, a zero-~ased
enrollment, an initial set of template patterns eon-
responding to a new word is generated solely Rome
input set of acoustic portrays The set of template
patterns is created by linearly segmenting the income
in acoustic parameters and driving the template pat-
terns therefrom. Thy second step, a training pro-
seedier, makes use of a set ox acoustic parameters
derived from the speaker, and the recognition results
that is, speech statistics) ox known or assumed
utterances which provide an initial trough cut for
the temple patterns, to create better templates.
hi it accomplished by performing a recognition on
I each word within a known us orange and using a known
word yodel.

~Z~312~

--38--

Turning f irrupt to the 2ero-based enrollment
technique, a number of acoustic kernels, each with a
minimum and maximum duration, are set for the word
based on the duration of the word. 'Thy beginning and
end of the word are determined, a will be described
blow, and the frames of acoustic parameters are then
proportionally assigned to each kernel live frames
per kernel). the template pattern statistics, the
mean- and variance, can then be calculated from thy
acoustic parameters In the illustrated embodiment,
zero based enrollment employs, fox example, ten awakes-
tic kernels for a word (an average of 50 frames in
duration) t each having a minimum duration of two
frilliness and a maximum duration of twelve flames. There
results a set of tactics which can be employed for
recognizing that lord or which can be improved for
example as described blow.

Joe the training method, the input data
include not only the acoustic parameters for the pow-
ken input worst but training data from a previous
least C05t pith starch. hi data can include the
tentative beginning and ending times for a ward. If
no threshold~ng it performed, and there are no dwell
iamb constraint, the dynamic programming should give
the same elite a the grammar level dynamic program-
mint., It the dynamic programming at the word level
ended where expected, and trace back was performed
within the word at the Attica kernel level, the
trac~back formation help create better template
pattern ire the word. by a result, a Better jet of
template earl be achieved. In the illustrated ~mbodi-
ant, a special grower keynoting of the word Ill one
Boyle:. All ox thy kernel in the word are made
cove and rod level dyrlamic programming it preformed.

go
-39- -
1 For each word used in the enrollment process, one of
the most important aspects is properly setting the starting time
and the ending time for the word. Several procedures have been
employed. One procedure employs a threshold value hosed upon
the amplitudes of the incoming audio signal. Thus for example,
a system can be designed to "listen" to one pass of the speech
and to take the average of the five minimum sample values the
average minimum) and the average of the five maximum sample
values (the average maximum) during that pass. Then, a threshold
is set equal to the sum of four times the average minimum value
plus the average maximum value, the entire sum being divided by
five. The system then goes through the speech utterance again
and after several consecutive (for example seven) frame times
have amplitudes which exceed the threshold value, the word is
declared to have begun at the first of the frames which exceeded
the threshold value. Similarly, at the end of a word, the word
is declared to have ended at the end of the last of several
consecutive (for example seven) frames each of which exceeds the
threshold value.
A preferred approach however to determine the beginning
and end of an utterance during enrollment is to use a threshold
or "token" word. This provides for excellent noise immunity as
well as an excellent method of determining the beginning and end
of a speech utterance. Referring to Figure 7, a grammar graph
168, for use with dynamic programming, employ the "joker" word,
has a self loop 170 at a node 180, the self loop representing a
short silence The joker word, represented by an arc 182
between nodes 180 and 184 has a fixed or constant likelihood cost
per frame associated therewith. An arc 186 representing a long
silence, leads from node

-40-
1 184 to a node 188. When silence is the input signal, that is,
prior to the utterance being made, the self loop (short silence)
has a good likelihood cost per frame and the grammar "stays" at
node 180. When speech begins, the cost per frame for silence
becomes very poor and the fixed cost of the "joker", along an
arc 182, is good relative thereto and provides a path from node
180 to node 184. This initial joker frame is the beginning of
speech. Thereafter, the transition from node 184 to 188 denotes
the end of speech.
While the grammar graph of Figure 7 works adequately,
improved starting and ending times during enrollment can be
achieved using two "joker" words. Referring now to Figure 8,
a grammar graph 198 used by the dynamic programming begins at a
rode 20G, and so long as silence is received the grammar stays
in a self loop 202 representing a short silence (which has a low
cost), rather than proceeding along an arc 204 representing the
first joker word. The first joker word is assigned a relatively
high likelihood cost. When speech is encountered, the score for
the short silence becomes poor, exceeds the score for the joker
word (arc 204)~ and causes the grammar to traverse arc 204 to a
node 206. At node 206, a second joker word 208 having a slightly
lower likelihood cost than the first joker leads away from the
node. When long silence is recognized, the grammar traverses arc
210. This indicates the end of the word. This method gives
relatively good noise immunity (because of the added "hysteresis"
effect of the two joker words) and accurately determines the
beginning and end of the incoming isolated utterance employed
during the training. The two different likelihood costs per
frame assigned to the different

~992~:
-41-

"joker" words have the effect of Meg sure a
word it really detected (by initially favoring
silence); and then, once a word is detected, the detected
word is favored by the lower cost Seiko jury) to make
S sure silence is really detected to end the word.

The parameters employed in the illustrated
embodiment, on connection with the grammar of Figure 8
are:
Fir~t5econd Short Long
}0 JokerJokerSilence Silence
Min. Dwell ~i~e12 1 1 35
Sax. Dwell Tom 101 51 5
Likelihood Cousteau 1800
(Typical)
Referring to Figure pa, the joker word can
also ye employed to provide immunity to sounds such as
coughs during normal pooch recognition. In this
respect, it it prelude" to normal recognition. In
accordance with that aspect of the use of the joker
: 20 word, a grammar graph 220 has a starting nod 222; and
so long as silence is received the grammar stays in a
self loop 224 representing short silence Which has a
low cot) rather than proceeding either along an arc
226 representing a first joker word or an arc 228
leading to thy start ox the spoken grammar. when
speech is encountered the likelihood C08t for the
first joker word is relatively high and a transition
occurs lug the art 22~ unto the spoken grammar graph.

If however, ~no~-~peech~, such as a cough;
occur, it it the value of the first joker word which
prudes likelihood cost that cause the grammar to

92~
--42-

traverse arc 226 to a node 230~ The grammar stay arc
node 2~0" held there by a second joker wc~rdl on a self
arc 232, until a long silence is recogni~ecl. The long
silence allows the grammar to 'cravers an arc 234 back
Jo starting node 222. In this manner, the machine
returns to its quiescent state awaiting a push inpllt
and of fictively "ignores noise inputs a n owed above.,

System structure
Referring no to Figure 4, according to the
preferred embodiment of the intention" the hardware
con figuration of Figure 1 employs three identical
Gircuik boards; that is the signal processing Sarasota
board corresponding to circuit 26, a template matching
and dynamic programming circuit board corresponding to
circuit 28 and a second template matching and dynamic
programming board corresponding to circuit MU. Each
Buick has the same con iguratiorl~ the con figuration
being illustrated as circuit rye 218 in Figure 4. queue
circuitry 218 has three buses, an instruction bus 220,
a first internal data bus 22~, and a second internal
data bus 224.. Connected between data buses 222 and
224 are a arithmetic logic unyoke logy 226, a fast 8
bit-by-8 lit multiplier circuit 228 having associated
Thor aft accumulator 230 and latching circuit 232
arid 234, a ~iynam~c random access memory tram 236 for
example hurrying 128,000 word of sixteen bit memory
with latching circuit 23~, 240, and 242~ and a fast
trader Lomb 24~, for example having 2,000 word of
sixteen by Emory and situated Shea 246, 248~ and
25f30 A Rowley ~:olltrc~l store 252 east control over
thy oppression of the Turks and arithmetic elements
The control tore 252 is random Swiss Emory ho no




,, , , . . . . . .

3L2~9~2
--43--

an output to bus 222 and providing instruction data on
bus 220. The rightable control score, which may be for
example a OR by 64 bit JAM, stores the program in-
~'cructions and is controlled and addressed by a
m;crosequencer 254, for example an AND 2910 micro-
segu~ncer. A pick machine 256 is provided for clock
timing, dynamic Rake rev rush, and other timing
functions as is well known in the art.

This structure employs a double pipeline
method which enables the use of relatively inexpensive
static Wreck for the control store 252.

Important to fast operation for implementing
the Lapels transformation to perform likelihood cost
generation, an inverting register 260 is provided for
converting the output of the arithmetic logic unit 226
to a twos complement output when necessary the out-
put ox the invec'cing register is applies to the bus
224

The operation and programming of the boards
Jo I 28, and 30, it controlled by the particular program
code, which programming enables the board wren employed
as signal processor 26 to provide the acoustic porcine-
ethers necessary to perform the template matching end
dynamic prograloming. Similarly, the programming of
25 circuitry 2B and :30 enable the acoustic parameter to
be properly processed fur generating likelihood kowtow
and for implementing the dwindle programming.

On operatit~n7 noted above, 'eke template
axing Acadia dynamic programTning circuit 28 and 30
30 purloined the l~kel~hoot~ ought c~lcul~tiorls on amend,

~2g~
I

wrecked by the dynamic programming method. This can
ye accomplished for two reasons. first, the template
pattern data necessary to perform ~11 ox the likely-
hood calculations needed my the dynamic programming
portion of a board will be found on that board. this
it a result ox the highly structured grammar graph and
the restriction that templates are not shared between
words) Second board, for example board 28, receives
the necessary likelihood score it lucks to complete
the dynamic programming for the board The processor
control 32-orchestrates this tanner of information

The gralrunar graph referred to above and
described in connection with Figure S, it stored in
memories 236 and 244. Because it is stored in a
:15 . highly structured manner, the data represerlting one
groaner graph can be replaced with the date zippier-
setting a second grarlanar graph making the equipment
flexible or regressing different syntax combinations
of words our entirely new speech vocabularies (in which
20 case training on the now vocabulary words would have
to be done. In the illustrate embodiment, the data
replant is preferably performed by scoring multiple
granunars in mimicries 236 and 244 and electing one of
the growers under program control. In the illustrated
.25 embodiment, the process control I can include a disk
memory or storing additional groaners

In douche us noted above, the micro-
processor buffer 34 provide the c~p~b1llty of per-
owing variable rote proce~si~g. 5'hu5, the dynamic
programming end lulled score generation can fall
behind reel tire owe the riddle of speech
utt2ranc~ where the greater ~o~putation~l sequire~ent




.. , .. . . ...


-45-

occurs, catching up toward the end of the utterance
where fewer calculations need to be made. In this
manner, the entire system need not respond was noted
above to the peak calculation requirements for real
time speech recognition, but need only respond to the
average requirements in order to effect real time
recognition in the speaker dependent environment
illustrated herein.

Other embodiments of the invention, including .
additions, subtractions, deletions, and other modifica-
lions of the preferred described embodiment will be
apparent to those skilled in the art and are within the
scope of the following claims -
What is claimed is:

Representative Drawing

Sorry, the representative drawing for patent document number 1229922 was not found.

Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee  and Payment History  should be consulted.

Administrative Status

Title Date
Forecasted Issue Date 1987-12-01
(22) Filed 1985-01-17
(45) Issued 1987-12-01
Expired 2005-01-17

Abandonment History

There is no abandonment history.

Payment History

Fee Type Anniversary Year Due Date Amount Paid Paid Date
Application Fee $0.00 1985-01-17
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
EXXON RESEARCH AND ENGINEERING COMPANY
Past Owners on Record
None
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Drawings 1993-07-28 4 163
Claims 1993-07-28 4 158
Abstract 1993-07-28 1 29
Cover Page 1993-07-28 1 18
Description 1993-07-28 45 2,028