Note : Les descriptions sont présentées dans la langue officielle dans laquelle elles ont été soumises.
Y0983-0~9
L2365i7~7
APPARATUS AND ~ETHOD FOR PERFOR`1ING ACO~STIC `IATCHI~G
BAC~GRO~ND OF rHE INVE~TTION
I. FIELD OF THE INVENTION
The present invention relates primarily to the art of
speech recognition, specifically the art of perfo-ming
a statistical match between vocabulary words and incom-
ing labels produced by an acoustic processor in response
to a speech waveform input.
II. DESCRIP~IO~l OF THE PRIOR ART
Typically, the purpose of a speech recognition sys~em
or machine i5 to automatically transform natural speech
into some other form, for example written form. In
achieving this aim, various general approaches have been
considered. One approach is directed ~o simulating human
speech interpretation processes. Another approach is to
view speech in a statistical context.
In the statistical approach itself, several techniques
have been considered as suggested bv the Bahl, Jelinek,
and Mercer article, "A Maximum Likelihood Approach to
Con~inuous Speech Recognition," IEEE Transactions on
Y0983-029
-2- ~3~S77
Pattern Analvsis and Machine Intelligence, Volume
PAMI-5, Number 2, pp. 179-190 (1983). In the 8ahl et
S al. article, it is noted thaL the typical model of a
speech recognition sys~em includes a text generator
followed by a speaker, or talker. The text generator
determines what is to be said and the speaker produces
a natural speech waveform. The natural speech waveform
enters an acoustic processor the output from which en-
ters a linguistic decoder. Depending on the technique
employed, the above-noted elements may be associated in
various ways. Bahl et al. combine the speaker and
acoustic processor to function as an acoustic channel
wherein the speaker provides text as a speech waveform
and wherein the acoustic processor acts as a data
compressor which provides a string of labels (also re-
ferred to as symbols or fenemes) to the linguistic de-
coder. The labels may be generated in any of a number
of ways and are commonly identified collectively as the
string Y made up of sequential labels YlY2Y3---- The
purpose of the linguistic decoder is to represent the
original spoken text, in some prescribed form, based on
the incoming string of labels.
In the above~noted article, one acoustic -~
~rocessor the IBM* centisecond acoustic,
processcr (CSAP) -- is describe~
* Trade l~ark
Y0983-079
3 ~2~
as transforming the speech waveform into a slring of
parame~er vectors. Each parameter vector is compared to
stored prototypes (or standard vectors) -- the distance
from the parameter vector and each prototype being de-
termined. The "label" for the proto~;ype which is clos-
est is then assigned to the waveform parameter vector.
The label can have any of various forms and may be de-
termined in any of various 1rnown manners in accordance
with existing technology.
The purpose of the linguistic decoder is to perform a
matching process between the incoming labels and words
provided in the system vocabulary. In the probabilistic
approach set forth in the Bahl et al. article, the lin-
guistic decoder aims at de~ermining a word strinO l; that
has the highest probability of having produced the
string of labels Y1Y2Y3--- Mathematically, this is re-
presented by the expression:
2~ MaxPr(h¦Y~, (1)
the maximum probability of h given Y over all word
strings ~'. According to well-~nown prob~bility theory,
this can be ~-ritten as:
Pr~WIY)= Pr(W) x Pr(YIW)tPr~y) (2)
Y0983-029
-4- ~6~77
where Pr(Y) is independent of W and the probability of
a given word string W, namely Pr(W), is determined by a
language model in the linguistic decoder.
Suppose that at some point in the decoding process some
initial substring. for example YlY2 YT~ has been ten-
tatively decoded as the ~ord string WlW2...Wn. The
present invention is directed to determining a set of
candidate words W( +i) for which
( ~n+l)lYl YTYT+l YT'k' 1 n)
is relatively large --compared to other words in ~he
vocabulary - for some value of k.
In determining Pr(YIW), Markov modelling has been con-
sidered. The number of computations required by several
linguistic decoding techniques is noted in the Bahl et
al. article as being fairly high, especially with
larger vocabularies on the order of ~000 words and more
for example.
A key question in linguistic decoding has therefore been
how to determine Pr(Y¦W) for word strings from a vocab-
ulary without requiring inordinate computation time and
without sacrificing accuracy in decoding.
Y0983-029
~365~
This object and others are addressed by the present in-
vention.
SU~I!~RY OF THE I~VE~TIO~
In accordance with the present invention, a linguistic
decoder is provided which facilitates the determination
of which word or words in a vocabulary have the highest
probability of producing a par~icular string of labeis.
To achieve this object, the presen~ invention features
apparatus ,?nd method for performing a statistical match
of words to labels with a number of appro~imations that
expedite the matching determination without introduclng
undesired loss in accuracy. In addition, the presen~
invention relates to apparatus and method wherein ~.ords
having similar phonetic beginnings are matched against
the incoming labels simultaneously for so lonO as the
words have similar phonetic beginnings.
In noting the approximations embodied in the invention,
it should be recogni~ed that the invention models each
vocabulary word as a sequence of phones. ~ach phone is
represented by a phone machine. Each phone machine is
precisely characterized as having (a) a plurality of
states, (b) transitions from state to state and a prob-
2S ability associated with each transition, and (c) an ac-
Y0983-0~9
~ ~3~
tual output probabilit-y that a given label is produced
by a given phone machine at a given transition -- each
phone machine (corresponding to a given phone) defining
the probability that a given label is generated at a
given transition thsreof. It is possible to determine
match scores for words based on these characteris~ics,
however the number of computations is high. In accord-
ance with the present invention, prefer2bly each ohorle
machine is simplified by replacing the ac;ual l_bel
probability for each label at all transitions in a g ven
phone mach ne with a speciflc replacement value. The
specific replacement value is preferably selected so
1~ that the match value for a given phone with the rn-
placement value included is an overestimation o' tlls
match value achieved by the detailed match ~here ~he
replacement values do not replace the actual label
probabilities. One way of assuring this condition is by
selecting each replacement ~-alue so that no probabili~y
corresponding to a given label in a given phone machine
is greater than the replacement value thereof. 3y sub-
stituting the actual label probabilities in a phone ma-
chine with corresponding replacement values, the number
2_ of required computations in determining a match sco.e
for a word is reduced greatly. ~oreover, since the re-
placement value is preferably an overestimation, the
resulting rnatch score is not less than would have pre-
Y0~83-029
~3~S~7
viously been determined without the replacement. Ac-
cordingly, the object of recucing computation wlthout
eliminating likely candidate words is achieved.
Another approximation relates to an additional facLor
that enters into the determination of 2 match score be-
tween a word and a string of labels, n~mely a label
length dis~ribution associated with each particular
phone machine. That is, for each phone machine~ there
is a probability distribution that each nUmDer of labels
contained between a minimum L i and a maximum number
of labels Lmax is produced by the particular phone ma
chine. To further facili-ate and reduce computation in
1; accordance ~ith the invention, the l~bel leng~'l proba-
bility distribution (between the minimum length and
maximum length of iabels) is considered uniform so tha~
the probability of each length of labels (between L .
mln
and L ) is the same.
max
As an additional refinement! the present invention in-
cludes a limitation on the number of labels examined b~-
the phone machines in order to determine a match v~lue
between a corresponding word and a string of incoming
labels. This added feature achieves the objects of re-
2; ducing decoding delay and reducing inequalities arising
Y09S3-0~9
-8- ~2~S77
from compa-ing match scores for words of different
leng~hs.
o Furthermore, it is an object of the invention to derive
a list of candidate words b;- means of a basic f25~ ma.ch
(which approximates actu~l label probabillties for each
label in a given phone with a respective replaccment
value) or an alternative fast m~.ch lwhich also approx-
imates label length probabilities ~;ith a specifiec value
for a given phone) wherein the candida~e words are
processed by successive detailed match phone machines
and/or a language model in order to arrive at a single
word and, as appropriate, some possible alterna~i~es--
1~ especially where the fast match machines limit the num-
ber of ~ncoming labels examined to de~ermine a ma~ch
value for a phone.
In achieving the object of processing the beginnings of
words simultaneously, the present in~tention defines
words or portions thereof as phonetic baseforms arranged
in a tree structure. Each baseform, it is noted, is re-
presented by a sequence of phones each of wnich has i~s
own phone machine corresponding thereto. For each
baseformf a sequence of phone machines extends from ~he
2~ root of the tree. For so long as two or more baseforms
have similar phonetic begirmings starting from the root,
~09~ 0~9
g ~ 77
a commcn branch of phone machines is provided therefor
Two or more baseforms can hereby be selected or elimi-
na~ed as candidate words by simultaneously processing
~hem through Ihe same phone machines at the same time
for so long as the baseforms have sir"ilar beginnin~s
The object of reducing computation without loss in ac-
curacy is thereby rur her achie~ed
In a preferred form, the invention r_lates to perforl~ing
an acoustic match in a 1inguistic decoder wherein each
phone machine therein is characteri~ed as naving (a) a
plurality of stales and .ransi~ion pqths between sta-es,
(b) transi.ions tr(Sj/Si) havi!g probabili~ies Tti`j)
1~ each of which represents the DroD~qbili-'; OL a ~ransition
to a state Sj given a current state Si where Si and Sj
may be the same state or diffsrent states, and (c) actual
label probabilities wherein each actual label probabil-
ity p(ykli';) indicates .he probabili~y that a label y~
is produced by a given phone machine at a given transi-
tion from one state to a subsequent state where k is a
label identifying notalion; each phone machine including
(a) means for assigning to each Yk in said each phone
machine a single specific value P~(yk) and (b~ means for
2- replacing each actual output probability p(ykli~j) at
each transition in a given phone machine by the single
specific value p'(y~) assigned to the corresponding Yk
Y0983-0~9
.A 1 0
~2;3~5~7
Preferably, ~he replacement value is at least as great
as the maximum actual label probabilitv for the corre-
; sponding Yk label at any transition in a particular
phone machine.
The foregoing and o~her objec~s, fea~ures and advantages
of the invention will be apparenr rom rhe more par~ic-
ular desc-iption of the preferred embodiments o ~he
invention, as illustrated in the accompanying dra~ing.
BRIEF DESCRIPTIO~' OF THE DRAI;I~GS
FIG. 1 is an illus~ration of a ~ree s~ruc~ure of
baseforms according tO the inven~ion.
FIG. 2 is an illustration of an example detailec match
phone machine in accordanse ~ith the invention.
FIG. 3 is a trellis diagram showing the state t ar.si-
tions over time for the detailed ma~cn ?hone machine.
FIG. 4 is a general diagram cf a phone machine accord ng
to the invention.
FIG. ~ is a diagram showing a fast match computation
under predefined conditions.
Y0~83-0~9
-11-
~L~36~77
FIG. 6 is an illustration showing ~he derivation of a
starr-time distribution for a phone from the end-~ime
S dist-ribution generated by the phone machine correspond-
inO to the ~IOSt recent previous phone.
FIG. 7 is an illustration of an alternative fast match
phone machine accorair.~ to the in~-en.ion.
FIG. 8 is an illustration o' an ~lternativA fast macch
phone machine wherein the number of labels e~amined in
the matching procedure is limiced in accordance with the
invenlion.
: .
In the drawings, li~.e elements are des.cna~ed with sim-
ilar reference numbe~s~ and identical elements in dif-
1~ ferent specific embodimencs are designated by iden~ical
reference numbers.
Y~5~3-029 ~236S77
-12-
DESCP~IPTIC.~ OF PRE~Er~P~ED E~I~ODIMENTS OF THE I~-~E~lIO~
I. TREE STR~CT~P~E OF BAS~FOR`IS
; In Fig. 1, a plurality or -~70rcs are arranged in a tree
structure 100. Tlle tree has a root 102 from ~-hich a
pluralit~ of baseforms extend. Each baseform is a pho-
netic baseform and extends to a leaf of the tree~ eacll
leaf of the tree representing a ..orc in the vocabulary.
Pre~erably, the voc2buiary is on the order of SOOO-horis
or more, although smaller vocabularies may also be em-
ployed.
Each baseform is shohn including a string of pllones,
each baseform s~ar~ing at tne roo~ 102 of ~he tree
1~ structure 100 and ending at a respective leaf. A number
of baseforms, it is noted, have similar phonetic begin-
nings. For so long as a plurality of Dase,~orms na~e
similar phonetic beginnings, such baserorms share common
initial branches along the tree structure 100. Starting
at the tree root 10~, the baseforms 10~ through 118 for
manager, managers, memo, memory, memoranda, and memo-
randum, respectivel~y, all extend along a common branch
identified as the phonetic element MX. Similarly, the
baseforms 112 through 11~ further extend together along
~ the branches identified as the phoneti_ elements ~X,
EHO, MX. Where the baseforms de?art phonetically, the
~0~83-Q~9 -13~ 6S77
tree structure lG0 branches in different direclions. Tn
accordance with the inventior, the malching of ?hone~ic
S elemenls ~l~ and then EH0 is performed once for all the
baserorms 112 through 118. That is, baserorms are not
e.xamined se?aratek- one at a time. Instead, processino
takes place along ~ranches in the tree structure 100 so
tnat words na~ing similar phonetic oeginnin~s are ?roc-
essed simultaneousl5- for so lona as ~heir res?ecti~-e
phones lie along a common path of branches. ~y ?rsc-
essing along the common path, a number of words ma~- De
considered simultaneouslS- for inclusion in a list of
candidate words lrom which the actual recognized wora
lo may be selected. Li~ewise, a number or WO.da can be
eliminated simultaneously.
In Fig. ll the phonetic elemenls are phones each of which
corresponds to a conven-tionally definec sound. ~-hen
strung together, the phones form word-representinv
baseforms. For example, a first pronuncialion of the
word "the" is represented by ~he phone DH followed by
the phone ~-Hl. Approximately /0 ~o lOG phones, when
appropriatel~- selected and s~rung -~ogether, can re?re-
sent the words in the English 'anguage.
~; A word --when pronounced in ~ore than way-- is repres-
ented in Fig. 1 by distinct strings of conventional
Y0983-0~9
-14- ~36~7~
phones, as su~gested b-~ the disLinct baseforms for the
two pronunciations of the ~ord "~he", namel~ lHLi and
THE2. Alternativel~, the same ~ord mav be represer.ted
by a basefor~ of "clinks" wherein each ciin~ can e?re-
sent various ?ronunciations of a par-~icular por~ion of
a word. The t~-o pronunciations a~ter the DH for the word
"the" could be represen~ed by 2 sir.Ole clink which would
account for ~he ~Hl and EEl phones. Similar1~T-, a se-
quence of phones ma~- be in par311el wilh one or more
other phones as alternative pronunciations for the same
segment of a ~-ord or string of words --the parallel paths
together representing a clink. Darallel p~ths which
l~ cross word boundaries can also be re~re.,en~ed by a
c.link. Whethsr defininO basefor~s in cerr.ls or conven-
tional phones or clinks, similar principles .~re in-
volved. Hence, in this patent application, references
hereafter to "phones" are considered generic to inclu~e
conventional phones, clinks and other sim_la pllonetic
elements which represent elemental inputs.
II. PHONE ~1ACHINES
A. I~TROD~CTION
In the present invention, a statis,ical àetermination
is made as to ~hich words in the vocabulary have the
~'~983-079
highest proba'oility OL^ hz~ing generaled a string of in-
coming labels generated by an acoustic channel. .4s
noted herelnbefore, each h~ord is represented by a se-
quence of phones. Each phone is, n turn, character zed
by a respecti~-e ohone machine. Each phone machine storcs
data which provides an indication of the likelihood that
the phone corres?oncing there~o generates any label
string. ~hen a particular label s,.ing s entcred in~o
a gi~en pnone machine, the gioen phonc macnine derer-
mines --from its stored aata-- the li~;elihood ~hat he
given phone machine oroduces the p.~rticular incom~n~
label string.
l; ~s described below~ phone machinns o~ SeVera1 L~.'peS .1. C
em'bodied in accordance ~-ith the inven~ion. Fi.st~ .he.o
i5 a detailed match phone machine ~hich incorporates
predefined probabilities into a phone rllodcl. The de-
tailed match phone machine is charac~erized D~' ~a) a
plurality of states and l_ansitions between the s~ates:
(b) a probability associated hith each transition, that
the particular transition occurs; and (c) a probabil'ty
that, at a given transition, a particular label is gen-
erated by the phone. ~;hile the detailed match phcne
7; machine can be used in determining ~:ith grea~ exactness
how closely a phone matches a string of incoming labels,
tremendous calculation demands attend the detailed match
YQYS3-0''9
-16- ~65~
phone machine. To overcome ~he calcula.ion demands as-
sociated with the cetailed match phone ma_nine, a seccnd
phone machine is considered by rhe invention. The secvnd
phone machine, hereaf.er refer-ed to as a. basic ~~ast
match phone machine, mai;cs an appro.~imation ~ha~ sim-
plifies calculations relative LO the detailed match
phone machine. Specifically~ wherever a given lahcl has
cL probability of occurring at any .r_n~i~ioD in a gi~-?.n
phone, tha~ probability is assigned â single specific
value --which preferably is at least as large as the
highest probabilit~ of the label occurring at an,~ tr~rl-
sition in the phone. In a further embodiment~ h~reaf-er
1~ referred to as the alternatLve f~st ma~ch, the proDa-
bility of a phone genera~ing ~ string o' labels of an~
of a number of lengths is considered unirorm 'oy the pnone
machine corresponding thereto. In the alternative fast
match phone machine, a minimum Length and a ma~i~um
~0 length are specified bet~-een ~-hich tn_re is defined a
length distribution. In the al~ernative fast malch Dhone
machine, the probability OL any length ~-ithin the leng~h
dis~ribution is repiaced by a single defined ~alue and
tr,e probabilit~; of any leng~h ou~side ~he disrribu~ion
is zero. The alternative fast malch phone machine can
be applied to the basic fast match phone machine to
further reduce the amount of calculation required in
'tO~S3-0~9
- 1, - ~.;~36S77
determinin~ whe~her a word defined b~- a string of ~n~ne
machines qualifies as 2 candidate t;ord.
In accordance t~ith the in~ention, a list OL- words is
e.Ytrac~ed from the vocabulary of~;ords by processing the
hords through either basic fast match phone machines or
through alternative fast match phone machines that
pre~er~bl~- includes both label probabiiity reD~aceme!lt
values and length distribution replâct-menl vâlues. .he
words in the ex~rac~ed list of words are thereafter
processed by detailed match ?hone machines to arrive at
a single word. In this regard, i, should be reali~ed c"at
a langua,ge model (not described hereinj is also pr~,er-
1~ ably provided to eliminate inzppro?riate words ~an-
guage models, such as models emplG~-ing tri-grams, h~ve
been discussed in the art. Briefk , a language model
emploved with the in~ention includes probabililies ror
respective three t~ord sequences occurring, such proba-
bilities being based on data deri--ed rrom a text. Spe-
ciricall~, the language moeel is im?lemented in PL/I on
an IB~I 4341 using an IB~i 33S0 dis~ drive ror the lang~ave
model s~orage. The IB~I 4341 is also connected to the L~S~
mâtch processor, the detailed match processor. and the
~5 rront end processor. In addition, the IB~' 4341 is also
connected to â work station implemented in, for example,
Pascal on an Apollo Domain computer to enable a user to
~0983 ~07~
36~77
-18-
enter input to the s stem. The wor~ station is connected
to the host IBM ~341 through an IB~ person31 computer
and a ~,Q'~ commur.icztions conlroller.
E~ch of the above-notec phone machines receives as ~n-
puts a start-lime dis~ribution and a string of labels
and determines therefrom a match value t~hich SUggeatS
the ll~elihood that a given phone produces a sequence
of labels in the string.
Each of ~he above-mentionea typGs of phone machines is
now described in greater detail.
B. DETAILD ~TCH PHONE ~IACHI~E
In Fig. 2, a sample detailed match phone machine ~00 is
1~ depicted. Each detailed match phone machine is a ?rob-
abilistic finite state machine cnarac~erized bv ~aj a
plurality of states Si, (b) a plurality of transitior.s
tr(SjlSi), some of the transitions extending between
differen~ states and some e~tending from a stale back
7 to itself, aach transieion having associated therewith
a corresoonding probabiliry, and (c) for each label that
can be generated at a Darticuiar transition~ a corre-
sponding actual label probability.
~0~o3-0~9
i77
-19-
In Fig. ~, seven states 51 th..ougn 5 are provided and
thirteen transitisns trl ~hrou~h trl~ are pro~lded in
; Ihe detailed match phone macnine ~OO. A review o' Fig.
'~ shows that phone machine '~OO has three transitions
with dashed line paths, namely transitions trll, trl~,
and trl3. At each of these three transitions, the Phone
can change Lrom one state to ano~her with3ut prod~cing
a label and such a transi,ion is, accordirgl~-~ referred
to as a null transition. Along transitior.s Lrl through
trlO labels can be produced. Speciri^all!,-, along each
transition trl through trlO, one or more labels may ha~-e
a distinct probability of being gener~.ed the.eat.
1~ Preferably~ fo- each transl.tion ~here is a ?robabil e;
associated ~ith e~ch label ,h-t c~n be ge~er2ted in the
system. Tha~ is, if there are ~wo hundred 'abels ~ha,
can be selectively generated by the acousric channel,
each transition ~that is not a null) has ~wo hundred
"O "actual label probabiliLies" associated rherewith --each
of which corresponds to the p-obability ~hat a corre-
spondinv label is generated b~- the phone a, the partic-
ular transition. The actual label probabilities for
transition trl are represented b~ tne symbol p followed
rr, b~- the bracketed column of numerais 1 through ~OO~ each
numeral representin~ a given label. rOr label 1~ ~here
is a probability pll¦ that the detailed pllone machine
"OO generates the label 1 at transition trl. The ~rarious
VQ98~-0~9
-20-
actual label pro;~abilities are stored with relacion to
the labei and a corres?onding Iransiticn.
S When ~ st.ing of labels YlY~Y~--- is presented to a de-
tailed match phone machine ~uO corresponding to a Oiven
phone, a match procedure is performed. The procedure
associated wilh the detailed malch phone machir.e is e.~;-
plained ~.~ith rere_ence to Fi~. 3.
.
i.C Figure 3 is trellis diagram of the phone machine oi Fig.
2. As in the phone machine, the Ireliis diagram shohs a
null transition from slate Sl to state S_ and ~ransi-
tions from state Sl to sta e S~ and ' om sla~e Sl ~o
state S4 The transitions be~i.;ecn othor s~ales are ~lso
1~ illustrzted. The trellis diaOram also sho~;s time meas-
ured in the horizon~al direction. Start-time probabili-
ties qO and ql represenl the probabilities that a pnone
has a start time at time t=to or t=tl, .espectivaly, for
the phone. At each start time tQ and 'l~ the various
transitions are shown. It should be noted, in this re-
gard, that the inrerval belweer. successive atart ~and
end) times is preferably equal in length ~o ~he time
inter~al of a label.
In employinO the detailed match phone machine ~00 to
determine how closely a given phone matches the labels
:.,.,,., .: :
Y09~ 9
57~7
-Zl-
of an incoming string, an end-time distribution for the
phone is sought and used in determining a match value
for the phone. rne notion of rel~-ing on the end-ti~.e
distribu-tion is common to all embodiments of phone ma-
chines according to the inven~ion. Tn genera~ing the
end-time distribution to perform a delailed match, the
delaiied match phone machine 200 involves computations
whicr. are exact and complicated.
Loo~in~ at the trellis diagram of ~ig. 3, we first con-
sider the computations required ~o have both a start
time and end time at time t=to. For this to be the case
accorting to the example phone machine structure sec
lS forth in Fig. 2, the following probaDilit-; aDpi~es:
Pr(S7~t=tO) = qO.YT(1-'7) ~ pr(s~r=to)~:r(~
Pr(S3,t=tO)xT(3~7) (3)
where Pr represents "probabilit~ of" and T represents
the transition probabilit~- bet-~een the two parenthet-
ically identified states. The above equation indica~es
the respective probabilities '~or the three conditions
under which the end time can occur at time t=to. 'lore-
over, it is obse.ved tha~ the end time a. ~=to is limited
in the current exampie to occurrence at state S7.
2S Loo~:ing next at the end time t=tl, it is noted that a
calculation relating to ever~ state other than state S
~98~-0~ 3~577
must be c,ade. The state Sl starts a~ -ne end time of the
previous phone. For ?urposes of e~planation, only the
calculations pertaining to state S, a.e set for~h.
~or state S4, the calculation is :
Pr(S4,t=tl)= Pr(Sl,t=tO)xT(1~~4)xPr(yl1~4)
Pr(s4~t=tO)~T(4~ Dr(vl4-~4) (4)
In words, the equation ~4) set .orth mme&iately ab3~-e
indicates that the ?robabilit; of the p!~one machine be-
ing in state S4 at -ime t=tl is de?endent on the sum of
the follor~ing two terms (a! the probabilitv of oein~ at
state Sl at time t=t~ multiplied Dy t~e probâL)ilit-. !-.;
of the transition from state Sl ro s~ate S c.ul~i~iied
l; further by the probability (Pr) of a Oiven label ---.--
in the string being generated gi~ren a transition rrom
state Sl to state S4 and (b) the probability of beir.g
at state S4 at time t=to ~ultiplied by the probabili~y
of the transition from state S, to itself and further
multiplied by the probability of generating the given
label --y-- during and gi~en the transition from state
S to itselE.
Simil2rly~ calculations pertaining to the other states
(excluding sta'e Sl) are also performed to generate
corresponding probabilities that the phone i5 at a par-
Y0983-029 ~236S ~7
ticular slate at time t=tl. Gensrall~-, in de~ermin_ng
the probabilit~ of bein~ at a subject state at a given
time, .he inven~isrn ~a) recogni7es each previollâ sta-e
that has a rransition which leads co the subject slate
anci the respec ive probability of each such previous
state; (b) recognizes, for each such pre~ious s~ate, a
value rep.esentirlg ~he prosâbilit~7 of the label that
must be genera~ed at the trar.sition berween each such
previous state and .he current state in order to con.orm
the label slring; and (c~ combines the probability
of each pre~ious slate and the resoecrive vdlue r~pres-
enting the label probaDilic~; to provids a subjecr. slare
1~ probability over a correspollaing transition, rhe overall
probability of beino at che subiec~ sr.a~e beinO deter-
mined fxom the subject stare probLbilities ove. all
transitions leading thereto. The calculation ror state
S7, it is noted, includes lerms relating to he three
O null transitions which permit the phorle to start and end
at time t=tl with the phone ending in state S7.
As with the probabilit~ dererminations relLrive to timss
t=to and t=tl, probabili~~- determi-lations for a series
of other end times are prefe-abl~ g~nera.ed IO form an
~~ end-time distribution. The value of ths end-time dis-
tribution for a given phone provides an indica-cion of
ho~ well the given phone marches the incoming labels.
Y09~3-0~9
3L;;~3~S77
",
In determining hot.~ woll a word m2tches a st~ing of in-
coming labels, the phcnes wnich represent the ~-ord ~re
processed in seauence. Each phone generates ar. end-time
distribution of probability v21ues. a match value ror
the phone is obtained b~T summing up the end-time prcba-
bilities and then taking the loOarithm of that sum. A
starl-time distribution for the ne~-.t phone is deri~ed
b~- normaii~ing Ihe end-time dis.ribution b~-~ for e~am-
ple, scaling each value thereof b~- di~-idinc each ~alue
by the s~lm so thaL the sum of scaled ~-alues totals one.
It should be reali ed that the present in~ ntion cou-
templates two rnethods of determinirlg h~ ~he num~er of
phorles to be e.~amined for a gi~en word or word s~rlng.
III a depth first method, compu~atiorl is made along a
baseform --compu~ing a running sublo~al with each SIIC-
cessive phone. ~hen the subtotal is found to be beiow a
predefined threshold for a given phone position
~0 therealong, the computation terminates. t~lternati~el~-,
in a breadth fi st method, a compu-ation for similar
phone positions in each word is made. The CCmDUtatiOns
following the iirsL phone n each word, the second phone
in each word, and so on are made. In the breadth first
2.~ method, the computaticns along the same number of phones
for the various words are compared at the same relati~e
phone positions Iherealong. In either method, the
'~09~3-079
- ~,
~;~;3~77
word(s) ha~7ing the l~rgest sum or match values is ~~e
sought obie Ct .
; As noted previousl~, a language moael which stor~s in-
rormation --such as tri-grams-- relating to words in
context ma~ be included to enhance the prob2Dilit~ of a
correct ~ord selection. Langua~,e modsls have bee~ re-
ported in the iiterature.
'0 ~le detailed match has been impiemenled in .~PAL ~;rray
Processor Assembl~j Language) which is the native assem-
bler for the Floating Poin~ S~-ste~.s, Inc. l90L. In his
regard, it should De recognized ~hat the ie ailed m.atch
reauires considera~le memor~ for stor_ng each o, the
actual label probabilities (i.e., the probabilit-j- th~t
a given phone generales a gi~-en label ~- at a given
transition); the transition probaDilities for eacn phone
maohine; and the probaDilities of a given phone being
at a given state at a gi~en ~ime after a defined start
time. '~e above-r.oted i90L is set up to ma~e the .arious
computations of elld times, match values based on ,for
example, a sum --prererabl~- the l-,garithm of the sum of
end time probabilities; start times based on the prsvi-
ousl~ generared end time probabilities; and ~orc match
2~ scores based on the match values for sequential phones
in a ~ord.
'~09&3-~ 7 ~
- 7 6- ~ ~3~57~
Because the detailed match is c~mputationally eY~pensive~
the present in~-ention includes a basic fest mz~ch and
o an aiternative fas~ match which reduces the compula ion
requirements withou~ sacrificing accuracy.
The fast ma~ch embodiments are employed to define a list
of on the order of tan ,o one hundred candida~e words
selected as the mos~ ely words in he vocabulary to
lG correspond to the incoming labels. The candida e words
are preferably subjected to the lan~ua~e model and to
the detailed match. By paring the number of ~-ords con-
sidered by the detailsd match to on the order of 1~ of
the words in the vocabulary, rns compu~ational cos~ is
'5 greatl~7 reduced while accuracy is main~~inea.
C. BASIC FAST MATCH
The basic fast match simplifies the detailed match by
replacing with a single value ~he actual label prooa-
bilities for a given label at all transitions a~ which
~0 the given label may be gener~ted in a given phone ma-
chine. That is, regardless of the transition in a given
phone machine ~-hereat a label has a probability of oc-
curring, the probabilily is replaced by a single spe-
cific value. The value is preferably an overestimate,
being at least as great as the largest probabllity of
~0~83-079 ~36577
.
the label occur-ing at any transition in ~he given phone
machine.
; By setting the l~bel probabilit~- replacement value 25
the maximum of the ac;ual label probabililiss fo- the
given label in the gi~en phone machine~ it is assured
that the match v~lue generated with the basic fast~.asci~
is at least as high as the match ~-aiue th;.t would .2s~1
from employing the detailed ma~ch. In ~hls way, the
basic fast match typic~lk- o~,-eres~imates th" matCil ~-21u~
of each phone so that more words are generall selected
as candidate'words. ~ords considered candisi~es acco c-
ing tG the dstailed match also pass muste~ in accordanc~
with the basic fast match.
Referring to Fig. 4, a phone machine 40G for the basic
fast match is illustrated. Labels ~aiso referred to as
symbols and fenemes) enter the basic fast match ?hone
machine 400 together ~ith a start-time distribution. Ths
~0 start-time distribution and the label string inpu~ is
like that entering the detailed match pllone machine de-
scribed hereinabo~-e. I should be realized chat the
start time may, on occasion, not be a distribution ~ver
a piurality of tlmes but may, instead, represent a pre-
~; cise time --for example following an interval of si-
lence-- at which the phone begins. ~ihen speech is
Y0983-0~9
6~i~77
soncinuous, however, Ihe end-time dist ibution is used
to define the start-cime dis~ribution (as is discussed
o in greater delail hereinbelow). The phone macr.ine 400
generates an end-time distribution and a match v~luo for
the particular pnone from the generaced end-time dis-
tribution. The match score for a ~ord is defined as the
sum of match v~lues for co3ponent phones --~t least tne
rirst h phones in the ~;ord.
Referring no~ to Fig. ~, z diagram of a basic fast match
CompUtaliOn is illuscrated. The basic fast match com-
putaeion is only concerned ~ h the start-lime di3trib-
ution, the number --or length Ot- labcls-- prcduced by
lo the phone, and the replacement values D~ associat-d
with each label Yk. By subst tuting all aclual label
probabilities for a given label in a given phone machine
by a correspondlng replacement value~ the basic fast
match replaces transition prob bilities with length
distribution probabilities and obviat s the need for
including actual label probabilities ~hhich can difrer
for each transition in a given phone macnine) and prob-
abil ties of being at a given scate at a given ti~e.
In this regard, the length distributions are determined
2~ from the detailed match model. Specifically, for each
length in the length distribution, the invention pref-
Y0~3-0~9
3~577
erably examirles each slate individuall. and determines
for each state the various ~ransition paths b~; ~nich the
currentlv ex2mined state can occur ~a? glven a partic-
ular label length and (b) recardlsss of the out?uts
along the transitions. The probabilities for all tran-
sition paths of the particular length to each subject
s,are are summed and the sums for ~11 the subject states
are ,hen adaed to indicate the probaDilit~ of a Oi--en
len.gth in the distribution. The above procedu-e is re-
peated for each length. In accordance ~i~h the pre~erred
form of Ihe invention, these computations are made with
reference tC a trellis diagram ?S is known in ~he ~rt
1~ of ~arko~ modellinc. For transi~ion ?aths w-hich share
branches along the trellis s~ructure, the compll~aticn
for each com~on branch need be maae onl)7 once and is
applied to each path that includes the common branch.
In the diagram oE Fig. ~, th-o limita~ ons are inciuded
bv hay of example. First, it is assumed that the length
of labels produced b~- the phone can be zero, one, tho,
or three having respective probabilities of lo~
and 13. The start time is also li~.ited, permi .ir.g onl~;
four start times having respective probabilities OI q0
7~ ql, q7, and q3. With these limitations, the following
equations define the end-time distribution of a subject
phone as:
~0983-Q~9
~23~577
-30-
~o = qOlo
~1 = qllo + qO1121
_ ~ ~ q A 1 0 q 1 1 2
q~ 3 ql-~P P3 qO 3-1P~P3
q~llP4+ q212P3P4+ql 13P2P3 P4
~;=q3l2pip5+ q2l3p3p4p;
~= q313p4~
In sxâmining the equa ions, it is ooser~-ea ~hat ~3 in-
cludes a ~erm correspondinO to ~ach or rour start times.
The first term represents the probabiii~y that the phone
starts at time t=t3 And produces a length of zero l~Dels
--the phone slartin.~ ~nd ending a~ the same time. ~le
second ter~ represen~s the prob~bili y ~hat ~he pnone
starts at time t=~2, that the length or labels is one,
and that a label 3 is produced by the phone. The third
term represents the prob~bilit~; that the phone star~s
at time t=tl, that the len~th Oc labels is two ~nar..ely
labels 2 and 3), and that labels 2 and 3 are produced
by the phone. Similarlyj the fourth ~erm represents the
probability that the phone starts at time t=to; tha~ the
lenOth of labels is three; and that the three labels 1
2, and 3 are producsd by the phone.
7~ Comparing the computations recuired in the basic fast
match with those required by the detailed match su~gest
Y~3-0~ 5~7
-31-
the relative simplicit~ of the former relative to the
latter. In this regard, it s noted that the p'y ~alue
remains the s2me for each a?pearance in all ~he
equations as do the label length probabilities. `lore-
over~ with Ihe length and start time~limitations~ .he
computations for the later end times become simpler. For
e.~ample, at ~6~ the phons must start at time t=t and
lQ all three labels 4, ;, and 6 mUsl be produced o~- che
phone for that end time to ap?ly.
In generating a match value for a su~joct p~one~ the end
time prob~bilities along the deiined ond-time dislrib-
ulion are summed. If desired, the iog of t~e sllm is ta~n~n
:1~ to provide the e.YprCSSiOn:
match value - logi0(30 l --- +
As noted previously, a match score for a word is readil~
determined by summlng the malch values for successi~e
phones in a particular ~-ord.
.
: 20 In describing she genera ing of the star~ time distri~-
ution, reierence is made to Fig. ~. In Fig. 6(a), the
word THE1 is repeated, bro~en down into its compone.nt
: phones.~In Fig. 6~b), the string of labels is d~epicted
over time. In Fig. 6(c), a first start-time distribution
2; is shown. The first start-time distribution has been
.
1.
Y0')83-0~9 ~36~7~
3~
derived rrom the end-time dist~ibution of the most re-
cen. previous phone (in the previous word which mav in-
o clude a "hord" of silence). ~ased on the label inputs
and the start-time distribution of Fig. o(c', the end-
time distribu~ion for the phone Dh~ ~DH~ is generaled.
The start-time distribulion for the ne~t phone, ~H, is
determined by recognizing the time during which the
prs~.~ious pnone enc-time dis~.ibution exceeded a thresn-
old (~) in Fig. o(d). ~A) is de~ermine~ inditiGually for
each end-time distribu~ion. Preferabl~-~ (h) is a fur.c-
tion of the sum of the end-time distribution values for
a subject phone. The inter~-al bstween ~imes a and b ~hlls
1~ reprssents ~he time during which ~hs ster~-ti:ne dis-
tribution for the pnone ~'H is ss~. (See Fig. ote).) The
interval between times c and d in F~g. o~sJ corresponas
to the times between which t~.e end-~ime distribution for
the phone DH e~ceeds the threshold (A) and bet-weer which
the start-time distribution of the ne~L phone is set.
The values of the start-time dis~ribution are obtained
bv normalizing the end-time dis~ribution by, for e.~am-
ple, dividing each end-time value b~ tlle sum OI the
end-~ime values wh ch ex~eed the threshoid (A).
2~ The basic fast match phone machine 400 has been imple-
mented in a Floating Point Systems Inc. 190L with an
~.PAL program. Other hardware and softhare mav also be
o ~ ~
~LZ~S~
-33-
used to develop a specific form of Ihe invention by
following the teachings se-~ forth harein.
C. ALTER`;ATI~E FAST `IiTCH
The basic fast match employed alone or, preferably, in
conjuncrion -~;ith the detailed match and/or a lar.gua,e
model greatlv reduces computation requirements. Io -u.-
ther reduce computation21 requiremen,s, the pr~sent ir-
vention iurther simplif-es the detailed match bv
defining a uniform label length d stribution bet~een t~-o
lengths --a minimum length Lmin and ~ maximum len~t;
L . In the basic fas~ match, the ?ror;abilities of a
max
phone generating labsls cf a given leng~h --namely i~,
1~ 11, 12, etc.-- typically hav2 differing valuss. ~ccord-
ing to the alternative fast match, ~he probabilitv for
each length of labels is replaced by a single uniform
value.
Preferably, the minimum leng h is equal to the smallest
~0 length having 2 non7ero probabilit)~ in the original
length distribution, although other lengths may be se-
lected if desired. The selection oi the maximum length
is more arbitrary than the selection of the minimum
length, but is significant in that the probability of
lengths iess than the minimum and greater than the max-
~&3-0~9
7~
imum ~re set as zero. B~ defining the length r)robabilitv
to e.~ist between only th~ mir.imum ler.g~h and the;na~:imum
lenath, a uniform pseudo-dis,ribution c&n be set fo:-Lh.
In one approach, t~le uni'orm p-obabil~t~- can be set 2S
the average probabilit~ o~er the pseudv-distribution.
Alternatively, the unirorm probabilit~ can be set as the
ma~imum o. the lenOtr. probabili~ies that are replaced
bi7 the uniform ~-alue.
The effec~ of characLer,~ing all the l~bel ier~.gth ~rsv-
abilities as equal is resdil~- obse~vcd hith reference
to the equations se- forth above for the en~-~ime cis-
trib-ltion in the bssic f2st ma~ch. Specificali~ .he
1~ length probabilities can be fac~ored OUt ?S a consta!~t.
ith L i being set at zero and all leng~h ?robabilities
being replaced by a single constznt value, the end-time
distribul;ion can be characterized as:
= ~ /1 = q + ~ P
m ` m m m-l m
wllere "1" is the single uniform replacement ~alue and
where the ~alue for p corresponds prererably ~o the
replacement value for a ~iven label being gener~ted in
the given phone at time m.
~or the above equation for ~ , the malch value is defined
2~ as :
Y0983-0~9
3~ ~3~
match ~-alue = log10(~0 + ~1 + ~~ + ~m) glû ~
In comparirg the basic fast match and the alterna~i~e
fast match, it has beerl found that -he number of requi-ed
additions and multipli-ations are greatl~- reauced by
employing the alternative fas~ match phone machines.
~Tith L = û, it has been found that the basic fast
mln
malch requires fort~ multipllcaLions and ~t.~ent~ addi-
lû tions in that the leng~h probabilities must be consid-
ered. ~t~ith the alternative fast match, ~ is determined
recursively and re~uires one multiplication and one a~-
dition for each successi~Te ~ .
m
To further illuât~a~s ho~ ~he al;hrnati~e fas- m~tch
l; simplifies compu~ations, Figures , and ~ are provided.
In Fig. 7(a), a phone machine emDodimen~ 7ûO corre-
sponding to a minimum leng~h L i =û is aepic~ed. The
maximum length is assumed to be infir.i e so that ~he
length distribution may be characterized as uniform.
In Fig. 7(b), the trellis diagram resulLing -rom the
phone machine 700 is sho~n. Assuming -h3t stl,t times
af~er q are outside the start-time distribution, all
det~rminations of each successive ~ ~;ith m-n require
one addition and one multiplication. For determinations
2~ of end times thereafter, there is only one required
multiplication and no additions.
~09O3-0~9
-36- ~23~577
In Fig. 8, L i = '~ Fig. ô(a) illustrates a specific
embodiment of a phor.e machine 800 therefor and Fig. S(b
shows a corresponding trellis diagram. Because L i ='
the trellis diagra~ of Fig. 8~b) has a zero probabili ;
aiong the paths mar~ed u, ~r~ h'~ and 7. For those end
times which e~tend betheen ~4 and ~ , it is noced that
four multiplications and one addition is required. ror
end times ~reater than n+4, ons multip]ic~tion ~nd no
additions are required. This embodim nt has been imDle-
menled in AP.~L code on a FPS l~OL.
It should be noted tha~ additionai s;ates may be adaed
to the Fi~. / or Fig. o embodimAnts as c'esired in ~c-
1; cordance with the in~ention. For s~:amDlA, ar.y number
of states with null ~rarsitions ma~ be included witr~out
altering the value of L .
mln
D. IIATCHI~'G BASED 0~ FIRST J LABELS
As a iurther refinemenl to the basic fast matcn and al-
ternati~e fast match, the p~esent invention contemplates
that only the first J labels or a string which enters
a phoIIe machine be considered in the match. Assuming
that labels are produced by the acoustic processor of
an acoustic channel at the rate of one per centisecond,
~5 a reasonable ~-alue for J is one hundred. In other ~-ords,
iO'~o3-0~9
,7- ~z3~77
labels corresponding to on the order of one second of
soeech ~-ill be proiided to determine a match be.tween a
phone and the labsls entering the phone machine. B~
limiting the number of labels e.iamined, two advantages
are realized. FirsL~ decoding dela~ s reduced and,
second, problems in comparing the scores of short words
~ith long words are substantially a~oided. ~le len~th
o~ J can~ of course~ be variea as desired.
The effect of limiling the mlmber Gf labels e~amined can
be noted with reference to the trellis dia~ram of r ~.
8~b). Without the presen. ref~nement, the f2s~ match
sco~e i5 the sum o. the p-obabilities OI ~. ~S a]on~ ~he
1; bottom row of the dia~ram. That is. the ~robabi' ir, of
being at state S4 at each time sLarting a~ ~=to (for
Lmin=O) or t=t4 (for Lmin=4) is determined as a a and
all ~ '5 are then totalled. For Lmin=4, ~here is no
probability of being in state 54 at an~ ~ime bei'ore ~4.
~O With the refinemen~, the summing of a ~5 termiIlates at
time J. In Fig. 8(b), time J corresponds to time t +,.
,
Termina~ing the e~amination of J labels over J time in-
` :
tervals can result ln the following two probabllit~
summations in determining a match score. First, as de-
2; scribed hereinbefore, there is a row calculation along
the bottom row of the trellis diagram but only up to the
..
~G983-0~9
~:~3~77
, ~
time J-l. The probabilities of being in state S4 at each
time up to time J-l are summec to form a row score.
; Second, there is a column score which corresponds tO the
sum of probabilities that the phone is at each respec-
tive state S0 through S4 at time J. That is, the column
score is: 4
column score = ~ Pr~Sf,J)
f=0
The match score for a ?hone is obtained by summin~ the
roh- score and column score and then ~a~ing ~he logarithm
of that sum. To contlnue the fast m~tch for the next
phone, the values along the bottom row --prefer~bi~- in-
cluding time J-- are used to derive the next ?hone
l; start-time dislribution.
After determining a match score for each of b consec-
utive phones, the total for all phones is, as before
noted, the sum of the match scores for all the pnones.
In eY.amining the manner in which the end-time prob~bil-
ities are generated in the basic fast match and alter-
native fast match embodiments set forth abcve~ it is
noted that the determination of column scores does not
conform readily to the fast match computations. To bet-
ter adapt the refinement of limiting the number of la-
bels examined to the fast match and alternative match,
'~0~&3-0.~9
39 ~23~577
the present inven~iGn provides tnat ths column ssore be
replaced by an additional ro-~ score. That i5, an adci-
tional rot.~ score is determined ror the phone being at
state S, (in Fi~ure 8(bj) bet~;een t1mes J and Jt~ -~-nere
~ is the maximum number of states in~any phorle machine
Hence, if any phons machine has ten states, the present
refinement adds ten end times alonO the bo~tom ro~i of
the trellis for each of ~hich a probabiliby is deber-
mined. All the probabilities along the bott3m ro~ up to
and including Ihs prob~bility .?t ~ime J.~ are added to
produce a ma~ch score for the given phone. ~s before,
consecutive phone match values are summed to provide a
1~ : word match score.
This embodiment has been implementcd in ~?.~L cGde on a
FPS l90L; however as with other portions of the in-
vention may be implemented hith othe~. codcs on other
systems.
~: :
0 E. THE TREE STR~CT'~RE L~D rAST M.~TCH EM80DIMENTS
.
By employing the baalc fast match or al~ernative fas~
match --hith or without the maximum label limitatiorl--
the computatlonal time reauired in determining phone
match values is tremendously reduced. In addition, the
computational savings remain high even when the detailed
.
yo983-0~9 ~36~7~
-40-
march is performed on ~r.e-~ords in rhe fast match deri.ed
list.
Tlle phone match values~ once determined, are compared
alon~ the branches of the ~ree structure to determine
which paths o' phones are most probable. In Fig. 1, the
phone match values for DH and ~Hl should sum to a much
higher value for the spo~en ~ord "the" th1n the various
sequences of phones brancning from the phone ?1.~:. In th s
regard, it should be o~served that the ?hone match ~alue
of ~he 'irst ~1~ phone is computed only once and is ~hen
used for each baseform e;~tendir.g thererrom. In addition,
when the total score calculatetl alon~ a first seauence
l; of branches is -ound to be much lo~;er th~n a ~hrcsholc
value or much lot~er than the total score tor other se-
auences of branches, all b seforms e~tending rrom the
first sequence Day be simultaneously e~limina.ed as can-
didate words.
~0 With the fast match embodimGnts and the tree structure,
an ordered list of candidate hords is generated t~ h
greaL computat~onal sa~ings.
F. F~RTHER E~30DI!IE~TS
~ ~Q~ 3657~
Thus, r~hile the in--ention has been described ~ith ef-
erence to preferred embodiments thereof, it ~iill be un-
; derstood b~T those sl.~illed in the art .hat ~arious
changes in form and details may be made withou~ de?ar -
ing from the scope of the in~ention. ~or e.~;amDle, al-
though it is preferred that each replacement ~-alue be
no smaller than the ma.~imum actual prob~bilit~; it re-
places, ot;~er methods ma~- also be emplo~;ed to acilier~e
the desired ob~ec~ of realizing an overeslimatea ma~ch
~-alue at least mos~ OL the t me.