Note : Les descriptions sont présentées dans la langue officielle dans laquelle elles ont été soumises.
~ 2~6~184
LANGUAGE IDENTIFICATION WITH PHONOLOGICAL
AND LEXICAL MODELS
Field of the ll~vertioll
The present invention relates generally to the field of speech recognition and
5 more particularly to the problem of spoken language identification.
Back~round of the ~nventioll
Spoken language identification (LID) has been the subject of research for several
years. Initially, systems were developed to screen radio tr~n~mi~ions and telephone
c~ ve-~lions for the intelligence comml1nity. In the future, LID systems will become
10 an integral part of telephone and speech input computer networks which provide services
in multiple languages. For example, a LID system can be used to pre-sort telephone
callers (or conl~ulel users) into categories based on the language they speak, so that a
required service may be provided in an ~l~liate language. Examples of such services
include travel information, emer~ell.;y ~ t~nce, language i--tel~ lion, telephone
15 information and stock quotations.
Systems in the field of speech recognition generally pelrollll their analysis of a
given input speech signal based on certain linguistic models of language. These models
include acoustic models which are commonly based on the fact that spoken language
is comprised of a sequence of phonemes, which are the distinct, filn~l~ment~l speech
20 sounds of a given language. Phonemes may be combined into syllables, words and,
ltim~tely senle..ces.
Prior art LID systems have, in particular, been based on the acoustic p-op~lliesof languages. Specifically, they have been based on the fact that the languages of the
world differ from one another in their particular phoneme il-v~nto-y and in the
2 5 likelihood of occurrence of various sequences of these phonemes. Such systems may,
for example, perform phoneme recognition based on a corresponding phoneme inventory
for each of a set of candidate languages, followed by the application of a corresponding
phonemotactic (phoneme sequence probabilities) model for each of the given languages
2 21601~
to determine the likelihood that the recognized sequence of phonemes would occur in
that language. Then, the language for which the recognized phoneme sequence is most
probable may be identified as the spoken language.
~ r~m~ry of tlle Inv~ntion
Prior art approaches fail to take into account lexical distinctions between
languages. By using language-specific lexical models (as well as phonologic ones), the
present invention provides a method and a~dL~lus for LID which results in superior
language fli~crimin~tion capability relative to prior art systems. In particular, languages
differ from each other along many tlimen~ions including syllable structure, prosodics,
lexical words and grammar (in addition to phoneme inventory and phoneme sequences).
Thus, the present invention provides a superior technique for LID which uses lexical
models in addition to the phonological models used by prior art systems.
Specifically, the method of the present invention i~ntifies a speech signal as
r~l~senting speech in a given c~n(li~te language. First, the method ~lrOlllls acoustic
speech recognition on the speech signal based on the given candidate language. This
speech recogmtion results in the generation of one or more sequences of subwords and
associated acoustic likelihood scores. The acoustic speech recognition may, for
example, be based on a language-specific phoneme invento~y (i.e., the subwords may
be phonemes), and may apply a corresponding phonemotactic (i.e., phoneme transition
2 o probability) model for the given language to produce the associated acoustic likelihood
~- scores (e.g, probabilities) for each of the corresponding phoneme sequences.
After the acoustic-based speech recognition has been pt;lrolnled, a colle~ondinglexical model for the given language is applied to the phoneme sequences and their
associated acoustic likelihood scores. In this m~nner, the lexical ~h~r~cteri~tics of the
given language are taken into account in order to identify the most likely phoneme
sequence (~nmin~ that t_e given candidate language is, in fact, the language which
was spoken) and to produce a resultant likelihood score. This resultant likelihood score
(of the most probable phoneme sequence) may be used as an overall language likelihood
score for the given c~n-li(l~te language. In other words, the likelihood that the speech
3 o signal, in fact, l~L~scnts speech in the given c~n~1icl~t~ language may be equated with
3 2 1 ~
the likelihood that the speech signal comprises the most likely phoneme sequence (when
both acoustic and lexical language characteristics have been taken into account).
Finally, the speech signal is identified as ~lese~lhlg speech in the given language
based on the resultant likelihood score obtained. In accordance with one illustrative
embodiment, the speech signal is analyzed in accordance with the above method with
respect to a plurality of candidate languages, and is identified as representing speech in
the candidate language which produces the highest likelihood score.
Brief n~ ion of tl-^ r)~
Fig. 1 shows a prior art language i-l~ntific~tion system using phoneme
0 recognition and phonemotactic models of phoneme sequences.
Fig. 2 shows a language identification system using both phonological and lexical
models in accordance with an illu~ liv~ embodiment of the present invention.
n~t~iled n~-~ ;yliQ~
Fig. 1 shows an example prior art language identification system using phoneme
recognition and phonemotactic models of phoneme sequences. The system shown
classifies an input speech signal (generated from a speech utterance) into one of four
ç~n~ te languages -- Fngli~h, Spanish, ~n~l~rin or German. Thus, the system
comprises language subsytems 11-1 t_rough 11-4, each for p~ millg speech
recognition in one of the four c~n.li-1~te languages. Specifie~lly, language subsystem
2 o 11-1 performs Fngli.~h language speech recognition, language sub~y~lelll 11-2 pt;lrOlllls
Spanish language speech recognition, language subsy~L~lll 11-3 p~.rOlllls ~n(1~rin
language speech recognition and language sub~y~ l 11-4 performs German language
speech reco nition-
Each language subsy~hm 1 l-i comprises corresponding phoneme recognizer 12-i
and corresponding phonemotactics module 13-i. Thus, Fn~ h language subsystem 11-1
comprises Fngli~h phoneme recognizer 12-1 and Fn~ h phonemotactics module 13-1,
Spanish language subsystem 11-2 comprises Spanish phoneme recognizer 12-2 and
Spanish phonemotactics module 13-2, ~n~l~rin language sub~y~l~lll 11-3 comprisesM~n-l~rin phoneme recognizer 12-3 and ~n(l~rin phonemotactics module 13-3 and
4 ~ 8~
. --
German language subsystem 11-4 comprises German phoneme recognizer 12-4 and
German ph~nemotactics module 13-4. Each language sub~y~ ll-i produces a
corresponding log likelihood value which reflects the likelihood that the analyzed input
speech signal is, in fact, speech in the given language. Finally, the system of Fig. 1 also
5 comprises classifier 14 for classifying the input speech signal based on the log
likelihood values produced by the phonemotactics modules of the language subsystems.
Phoneme recognizers 12-1 through 12-4 may, for example, each be based on
conventional second order ergodic Continuous Variable Duration Hidden Markov
Models (CVDHMMs). Each ergodic Hidden Markov Model (HMM) has one state per
lo phoneme -- however, each phoneme is modeled by a time sequence of three
probability distribution functions (pdfs) with each pdf lepleselllillg the be~inning, the
middle and the end of a phoneme, respectively. Note that this structure is equivalent
to a three state left-to-right hidden Markov phoneme model. The duration of eachphoneme may be modeled by a four parameter gamma distribution function, where the
5 p~r~meter~ are: (1) the shortest allowed phoneme duration (the gamma distribution
shift); (2) the mean duration; (3) the variance of the duration; and (4) the m~xi~
allowed duration for the phoneme.
Different training procedures may be advantageously adopted to train the
phoneme recognition systems depending on the type of transcription and the ~ nment
2 o of speech waveform with the transcription which is available. For example, when the
word labels and the alignment of these labels with the speech waveform is available, the
phonemically segmented data may be generated automatically by obtaining the phonemic
transcription and the estim~ted duration for each phoneme using a Text-To-Speech(TTS) system and ~llelcl~i"g these durations linearly to cover the word duration. The
25 ph~nemically segm~nted data thus obtained may be used to initially train the ergodic
HMM models. These models may be re-trained using a convention~l segment~l k-means
algorithm ile.dlively until the models converge.
.Altern~tively, when the time aligned phonemic l~ scli~lion of the speech data
is available, the initial models may be trained using this data and the models may be re-
30 trained using the segmental k-means algo.il~ lively until the models converge.
And when the sentence level transcription and segment~tion is available, the phonemic
5 21~ 4
level transcription and segmentation may be obtained automatically as above, except that
the phoneme durations are stretched linearly to fit the whole sentence. The models may
then be trained iteratively as described above by using the segmente~l data so obtained.
Note that this method is similar to a conventional flat start k-means training procedure.
For the transition probabilities of a second order ergodic HMM, a trigrarn
phonemotactic model may advantageously be used for phonemotactics modules 13-1
through 13-4. Such a model provides more ~ çrimin~tive power than the phoneme
inv~illol ~ and bigram probabilities since the trigram phonemotactic c~ Les the
allowable phoneme sequences in any given language very efficiently. For example,given a set of c~n~litl~te languages, it will often be the case that there are certain three
phoneme sequences allowed in one of the c~nt~ te languages but not in the others.
The transition probabilities (i. e., the phonemotactics) may be trained using large
amounts of labelled speech. Alternatively, in the absence of enough transcribed speech
to train the transition probabilities, they may be appioxin~tecl using large amounts of
text (e.g., 10 million words per language advantageously obtained from varying sources
such as news wire services, n~w~a~cl~ and transcribed speech) and a conventionalgrapheme to rhoneme convertor. Specifically, the trigram phonemotactic models may,
for example, be trained by col~ lg text to phoneme strings and then by estim~ting
the trigram probability values by applying the following equation:
2o Pr( S3ISl,S2) = i~3 f( S3¦Sl~S2) + ~2 f( S3 I S2) + ~I f ( S3) (l)
where the weights ~3, ~2 and ~l are set to 1, 0 and 0, respectively, si is the phoneme
symbol "i" and f ( ) is the frequency of occurrence.
Cl~ifier 14 is used to classify the input speech signal (i.e., the speech uLl~ ce)
as comprising speech in one of the given languages. Specifically, each language
2 5 sub~y~Lem 1 l-i may advantageously be applied to a given speech utterance in parallel.
Then, the language sub~y~lelll which produces the highest log likelihood value is chosen
by classifier 14 as the language of the input speech signal. The log likelihood may, for
example, be computed on a per frame basis to advantageously avoid the bias toward
short utterances. In addition, since the phoneme set of each language may contain
216~84
different numbers of phonemes (Fngli~h, for example, has 42 phonemes whereas
Spanish has 27 and M~n~larin has 41), the computation of the log likelihood on a frame
basis helps to achieve norm~ tion with respect to the number of phonemes.
The log likelihood values generated by phonemotactic modules 13-1 through 13-
4 and used by cla~ifier 14 may, for example, be computed using the well-known Baye's
rule:
P ( x I Li ) = P ( x I ~i ) P ( ~; I Li ) (2)
where the Ps are conditional probabilities, x is the input speech signal, ,~; is the
phoneme sequence and Li is the phonemotactic model of the language i.
o Fig. 2 shows a language identification system using both phonological and lexical
models in accordance with an illustrative embodiment of the present invention. The
illustrative system shown c!a~ifies an input speech signal (generated from a speech
utt~rance) into one of four cancliclate languages -- Fngli~h, Spanish, ~an-l~rin or
Gerrnan -- as does the prior art system of Fig. 1. However, the system of Fig. 2advantageously uses lexical models as well as phonological models to improve system
accuracy. The illustrative system of Fig. 2 comprises language subsytems 15-1 through
15-4, each for perforrning speech recognition in one of the four candidate languages.
Speci~lcally, language sub~y~Lt;l.l 15-1 p~lrOll.ls Fn~ h language speech recognition,
language subsystem 15-2 performs Spanish language speech recognition, language
2 o subsystem 15-3 performs ~anllann language speech recognition and language subsystem
15-4 performs German language speech recognition.
Each language sub~y~lelll 15-i comprises corresponding phoneme recognizer 12-i,
corresponding phonemotactics module 13-i and corresponding lexical access module 16-
i. Thus, Fn~ h language subsystem 15-1 comprises Fn~ h phoneme recognizer 12-1,
Fn~ h phonemotactics module 13-1 and Fngli~h lexical access module 16-1; Spanishlanguage sub~y~l~lll 11-2 comprises Spanish phoneme recognizer 12-2, Spanish
phonemotactics module 13-2 and Spanish lexical access module 16-2; ~fan(larin
language subsystem 11-3 compri~es Man~1~rin phoneme recognizer 12-3, Man-larin
phonemotactics module 13-3 and ~an(larin lexical access module 16-3; and German
3 o language subsy~lenl 11-4 comprises German phoneme recognizer 12-4, German
7 216018~
phonemotactics module 13-4 and German lexical access module 16-4. As in the prior
art system of Fig. 1, each language subsystem 15-i produces a co~responding log
likelihood value which reflects the likelihood that the analyzed input speech signal is,
in fact, speech in the given language. Finally, the system of Fig. 2 also comprises
5 classifier 14 for classifving the input speech signal based on the log likelihood values
produced by the lexical access modules of the language subsystems.
Lexical access modules 16-1 through 16-4 generate corresponding log likelihood
values analogous to those generated by phonemotactic modules 13-1 through 13-4 in the
prior art system of Fig. 1. However, in the case of the illu~llalive system of Fig. 2,
10 these values have been based on a corresponding lexical model for the given language,
as well as on the colre~ollding phonological model. In particular, each of the
phonemotactic modules in the illustrative system of Fig. 2 yields one or more phoneme
sequences along with their associated (phonological) log likelihood scores. These
sequences and their associated scores are then provided to the collc;~onding lexical
15 access modules for further analysis in order to ~let~rmine the likelihood of each sequence
in fi-rther view of a l~r~uage-spec;fic lexi~l Jnodel. (Note that in the case of the prior
art system of Fig. 1 only the log likelihood score of the most likely phoneme sequence
need be provided by the phonemotactic modules, since no further linguistic analysis is
to be pe~rolllled -- thus, the log likelihood score of the most likely phoneme sequence
2 o reflects the prior art system's best estimate of the likelihood that the spoken ~ltter~nce
was in the given candidate language.)
Specifically, for each phoneme sequence, the lexical access module of the
illustrative system of Fig. 1 produces a lexical log likelihood score (as opposed to the
phonological log likelihood scores produced by the phonemotactic modules) based on
25 the likelihood that the given phoneme sequence compri~es lexir~lly "me~ningful"
speech. Then, the lexical log likelihood score is added to the phonological log
likelihood score to produce an overall likelihood score for the given phoneme sequence
(since addition of log values is equivalent to multiplication of the original values).
Then, the highest of these overall likelihood scores is produced as the language3 o likelihood score for the given language (i.e., the log likelihood score produced by the
co~le~ollding lexical access module and provided to cl~ifier 14).
a ~ 18~
Lexical access modules 16-1 through 16-4 may, for example, be based on the
lexical model described in F. Pereira, M~ Riley and R. Sproat, Weighted RationalTr~nsduction and theirApplication to Human Language Processing, DARPA Workshop
on Human Language Tech., Princeton, NJ, 1994. This method uses the concepts of
5 weighted language, tr~nedl1ction and finite state automata from algebraic automata theory
to decode c~ecades in speech and language processing. Lexical access can be
considered as a tr~nerlllction c~eca~1e since the lexical access problem can be
decomposed into a tr~nC~Ilcti~n, "D," from phoneme sequences to word sequences (a
lexicon), and a weighted language, "M", which specifies the language model. Each of
10 these can be represented as a finite state automaton.
The automaton for the phoneme sequence to word sequence tr~n~dllct;on "D"
may be defined in terms of word models. A word model (or lexicon) is a tr~ne(~ er
from a subsequence of phoneme labels to a specific word. To each subsequence of
phonemes, a likelihood may be ~e,eign~d in~lic~tin~ the probability that it produced the
5 specified word. Hence, different paths through a word model col.e~lld to di~e~phonetic realizations of the word which advantageously incorporates ~lt~rn~tive
pron~-nci~tions.
The langge model "M," which may be an N-gram model, may be implemented
as a weighted finite state acceptor. Combining the automata implementin~ "D" and "M"
2 o thus results in an ~utom~ton which assigns a probability to each word sequence, and the
highest probability path that the automaton estim~tes gives the most likely word- sequence for the given speech utterance. Thus, a best sequence of words which
c<~ spolld to a given speech utterance may be obtained, along with a corresponding
probability therefor. The log likelihood score produced as output by lexical access
25 modules 16-1 through 16-4 may be the logarithm of the probability so obtained.
The tr~n~ f r "D" (lexicon or word model) and the acceptor "M" (language
model) may advantageously be built using a large sample (e.g, 10,000 words per
language) obtained from a commercially available (or otherwise generally available)
multi-language transcribed speech data base, such as the data base compiled by Oregon
3 o Graduate rn~titl1te and described in Y. K. Mu~ y, R. A. Cole and B. T. Oshika, The
OGI Multi-Language Telephone Speech Corpus, Proc. of ICSLP 92, Banff, Canada,
9 2~ ~0 1~4
1992. The lexicon for each language advantageously compri~es a large number of
words (e.g, 2000 unique words), which includes the most frequently used words in the
language.
For clarity of explanation, the illu~l~dtiv~ embodiment of the present inventions is presented as comprising individual functional blocks. The functions these blocks
r~resent may be provided ~rough the use of either shared or dedicated har.lwa~
including, but not limited to, hal.lw~e capable of executing software. For example, the
functions of the system components presented in FIG. 2 may be provided by a single
shared processor or a plurality of processors. (Use of the term "processor" should not
0 be construed to refer exclusively to ha~-lw~Le capable of executing software.)Illustrative embo-liment~ may comprise digital signal processor (DSP) ha,dw~e,
such as the AT&T DSP16 or DSP32C, read-only memory (ROM) for storing software
performing the operations discussed below, and random access memory (RAM) for
storing DSP results. Very large scale integration (VLSI) hardw~. embotliment~, as well
5 as custom VLSI ch~;uiLl,y in combination with a general purpose DSP circuit, may also
be provided. General purpose computer system h~dw~e may also be used to
implement LID systems in accordance with the present invention.
Although a number of specific embo~liment~ of this invention have been shown
and described herein, it is to be understood that these embo-liment~ are merely
20 illustrative of the many possible specific arrang~ which can be devised in
application of the principles of the invention. N~nt;lous and varied other arrangements
can be devised in accordance with these principles by those of oldil1aL ~ skill in the art
without departing from the spirit and scope of the invention.