Patent 1246745 Summary

(12) Patent:	(11) CA 1246745
(21) Application Number:	1246745
(54) English Title:	MAN/MACHINE COMMUNICATIONS SYSTEM USING FORMANT BASED SPEECH ANALYSIS AND SYNTHESIS
(54) French Title:	SYSTEME DE COMMUNICATION HOMME-MACHINE UTILISANT D'ANALYSE ET LA SYNTHESE DE PAROLES BASEES SUR LES FORMANTS
Status:	Term Expired - Post Grant

Bibliographic Data

(51) International Patent Classification (IPC):	G10L 15/10 (2006.01) G10L 15/01 (2013.01) G10L 15/22 (2006.01) G10L 17/08 (2013.01) G10L 17/24 (2013.01) G10L 21/003 (2013.01)
(72) Inventors :	HUNT, MELVYN J. (Canada)
(73) Owners :	MELVYN J. HUNT
(71) Applicants :	MELVYN J. HUNT (Canada)
(74) Agent:	MEREDITH & FINLAYSONMEREDITH & FINLAYSON,
(74) Associate agent:
(45) Issued:	1988-12-13
(22) Filed Date:	1986-03-04
Availability of licence:	N/A
Dedicated to the Public:	N/A
(25) Language of filing:	English

Patent Cooperation Treaty (PCT):	No

(30) Application Priority Data:

Application No.	Country/Territory	Date
715,443	(United States of America)	1985-03-25

Abstracts

English Abstract

MAN/MACHINE COMMUNICATIONS SYSTEM USING FORMANT BASED SPEECH
ANALYSIS AND SYNTHESIS
ABSTRACT OF THE DISCLOSURE
Formants are extracted and stored from reference speech.
Input speech is suitably processed to derive unlabelled candidate
formants. The sets of formants from the input and reference
speech are compared using dynamic programming techniques. A
further sequence comparison provides time alignment of the input
and reference speech. The sequence comparisons extract a dissim-
ilarity measure based on the formant frequencies and other
characteristics of the speech. The reference speech resulting in
the lowest dissimilarity measure identifies the input speech
recognized by the system. System feedback may be provided and is
composed of designated responsive multi-voiced speech. The
multi-voiced output speech is obtained primarily by altering the
prosodic parameters and formant frequencies of the designated
responsive speech. Thus, the designated responsive speech may,
say in an aircraft communication system, use one voice output
when providing an information response to the pilot's recognized
input speech question and another appropriately strident voice to
issue the pilot warnings. The system also may be placed in a
training mode to evaluate performance and adjust parameters.

Claims

Note: Claims are shown in the official language in which they were submitted.

The embodiments of the invention in which an exclusive property
or privilege is claimed are defined as follows:
1. A speech recognition system comprising:
a) means for extracting and storing from a reference
speech vocabulary comprised of a plurality of vocabulary items,
on a frame by frame basis, a set of formant parameters comprising
frequencies and bandwidths, together with a measure of energy and
a measure of spectrum balance for each item of said reference
speech vocabulary;
b) means for storing vocabulary item template information
for said reference vocabulary;
c) means for storing information defining syntactically
allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis, a set
of unlabelled potentially errorful candidate formant parameters
comprising frequencies and bandwidths, together with a measure of
energy and a measure of spectrum balance for the speech to be
recognized;
e) means for comparing sets of said unlabelled potentially
errorful candidate formant parameters with any set of said
formant parameters of said reference speech vocabulary to provide
a formant dissimilarity measure between the two sets that is not
unduly sensitive to errors present in either set;
f) means for comparing said measure of energy and said
measure of spectrum balance for the speech to be recognized with
said measure of energy, and said measure of spectrum balance for
the reference speech vocabulary to provide energy and spectrum
balance dissimilarity measures;
g) means for combining said formant dissimilarity measure
and said energy and spectrum balance dissimilarity measure to
produce local dissimilarity measures;
h) means for identifying a sequence of vocabulary item
templates by aligning the speech to be recognized with the
reference speech vocabulary which alignment results in the lowest
total dissimilarity measure, wherein the total dissimilarity
measure is the sum of local dissimilarity measures over aligned
frame pairs of the speech to be recognized and the reference
- 37 -

speech vocabulary; and
i) means for outputting the identified sequence of
vocabulary item templates.
2. A speaker verification system comprising:
a) means for instructing a speaker to provide speech to be
recognized corresponding to at least one of a reference speech
vocabulary comprised of a plurality of vocabulary items for all
speakers;
b) means for storing speaker identities corresponding to
the speaker's reference speech vocabulary;
c) means for extracting and storing from said reference
speech vocabulary for each speaker to be identified, on a frame
by frame basis, a set of formant parameters comprising
frequencies and bandwidths, together with a measure of energy and
a measure of spectrum balance for each frame of each item of said
reference speech vocabulary for each speaker to be identified;
d) means for storing vocabulary item template information
for said reference vocabulary for each speaker to be identified;
e) means for storing information defining syntactically
allowed sequences of vocabulary items for each speaker to be
identified;
f) means for extracting on a frame by frame basis, a set
of unlabelled potentially errorful candidate formant parameters
comprising frequencies and bandwidths, together with a measure of
energy and a measure of spectrum balance of said produced
specified sequence of vocabulary items to be recognized;
g) means for comparing sets of said unlabelled potentially
errorful candidate formant parameters with any set of the formant
parameters of said reference vocabulary to provide a formant
dissimilarity measure between the two sets that is not unduly
sensitive to the presence of errors in either set;
h) means for determining the syntactically allowed
sequence of reference speech templates and their non-linear time
alignments that minimize a local dissimilarity measure comprising
said formant dissimilarity measure, energy and spectrum balance
dissimilarities summed over aligned frame pairs of the frames of
the speech to be recognized and the frames of the reference
- 38 -

vocabulary ;
i) means for outputting the reference vocabulary
determined by the syntactically allowed sequence of reference
speech templates;
j) means for identifying the reference speech vocabulary
by aligning the speech to be recognized with the reference speech
vocabulary which alignment results in a lowest total
dissimilarity measure, wherein the total dissimilarity measure is
the sum of the local dissimarility measures over aligned frame
pairs of the speech to be recognized and the reference speech
vocabulary; and
k) means for outputting a positive speaker identity
corresponding to the identified reference speech vocabulary if
the total dissimilarity measure is below a predetermined
acceptable limit.
3. The system of claim 1 wherein the total dissimilarity
measure is a least cost explanation of one set in terms of the
other set, whereby when each formant parameter in the reference
speech vocabulary set is paired with said unlabelled potentially
errorful candidate formant parameters in the speech to be
recognized there is a cost that is a monotonically increasing
function of a difference in their frequencies and when an
unlabelled potentially errorful candidate formant parameter is
left unpaired there is a cost inversely related to a confidence
measure placed on that formant candidate.
4. The speaker verification system of claim 2 wherein said
total dissimilarity measure is obtained by further comparing the
formant parameter of the speech to be recognized with the formant
parameters of the reference speech vocabulary given the
determined time alignment.
5. The speaker verification system of claim 2 wherein said
total dissimilarity measure is a formant dissimilarity measure.
6. A multi-voiced output system comprising:
a) means for extracting from a reference speech vocabulary
- 39 -

set of natural speech, on a frame by frame basis, formant
parameters comprising frequencies and bandwidths, energy,
fundamental frequency voiced and unvoiced decision, for each
frame of said reference speech vocabulary set;
b) first means for storing at least the formant
parameters, energy and voiced and unvoiced decision for each
frame of said reference speech vocabulary set;
c) second means for storing syntactic and prosodic rules
applicable to said reference speech vocabulary set;
d) means for selecting reference speech out of said
reference speech vocabulary set, and choosing a set of parameters
for modifying said selected reference speech;
e) means for modifying said selected reference speech in
accordance with said chosen parameters by altering one or more of
the formant parameters, energy, voiced and unvoiced decisions
stored in said first means;
f) means for synthesizing said modified reference speech
using an excitation waveform of duration and form similar to the
excitation waveform of said selected reference speech; and
g) means for suitably analog converting and outputting
said synthesized modified selected reference speech.
7. The system of claim 6 wherein:
a) said first storage means includes storage of the
fundamental frequency of each frame of said reference speech
vocabulary set; and
b) said modifying means includes altering the fundamental
frequency of said selected reference speech.
8. The system of claim 6 wherein:
a) said first storage means also includes means for
storing the bandwidth of the vocabulary set of each frame of said
reference speech; and
b) said modifying means includes means for altering said
bandwidth.
9. The system of claim 6, wherein:
a) said means for extraction includes a Laryngograph.
- 40 -

10. The system of claim 6, wherein:
a) said extraction means provides an error signal from a
linear predictive analysis of said reference speech vocabulary
set, said error signal being stored in said first storage means;
and
b) said synthesizing means uses said error signal as the
excitation waveform.
11. A man/machine speech communications system comprising:
a) means for extracting and storing from a reference
speech vocabulary comprised of a plurality of vocabulary items,
on a frame by frame basis, formant parameters comprising
frequencies and bandwidths, energy and spectrum balance measures
for each frame of said reference speech vocabulary, said
reference speech vocabulary being divided into a recognition
speech vocabulary and an output speech vocabulary;
b) means for storing vocabulary item template information
for said recognition speech reference vocabulary;
c) means for storing information defining syntactically
allowed sequences of vocabulary items in speech to be recognized;
d) means for extracting, on a frame by frame basis,
unlabelled potentially errorful candidate formant parameters
comprising frequencies and bandwidths, energy and spectrum
balance measures for each frame of said speech to be recognized;
e) means for comparing sets of said unlabelled potentially
errorful candidate formant parameters with any set of the
recognition speech formant parameters to provide a formant
dissimilarity measure between the two sets that is not unduly
sensitive to the presence of errors in either set;
f) means for determining the syntactically allowed
sequence of recognition speech templates and their non-linear
time alignments that minimize a total dissimilarity measure
comprising at least said formant dissimilarity measure, energy
and spectrum balance dissimilarities summed over aligned frame
pairs;
g) means for outputting a signal indicative of the
recognition speech template having the lowest total dissimilarity
measure;
- 41 -

h) said means for extracting and storing further including
extraction and storage of fundamental frequency voiced and
unvoiced decision, for each frame of said output speech
vocabulary;
i) means for storage of syntactic and prosodic rules
applicable to said output speech vocabulary;
j) means for selecting a reference speech out of said
output speech vocabulary responsive to said output of a signal
indicative of recognition speech template, and means for choosing
a set of parameters for modifying said selected output speech;
k) means for modifying the characteristics of said
selected output speech in accordance with said chosen parameters
by altering one or more of said stored formant parameters, energy
or duration or form of the excitation waveform of said selected
output speech;
l) means for synthesizing said modified selected output
speech; and
m) means for suitably analog converting and outputting
said synthesized modified selected output reference speech.
12. A speech recognition method comprising the steps of:
a) extracting and storing from a reference speech
vocabulary comprised of a plurality of vocabulary items, on a
frame by frame basis, a set of formant parameters comprising
frequencies and bandwidths, a measure of energy and a measure of
spectrum balance for each frame of each item of said reference
speech vocabulary;
b) storing vocabulary item template information for said
reference speech vocabulary;
c) storing information defining allowed sequences of
vocabulary items in speech to be recognized;
d) extracting, on a frame by frame basis, unlabelled
candidate formant parameters comprising frequencies and
bandwidths, a measure of energy and a measure of spectrum balance
for the speech to be recognized;
e) comparing sets of said unlabelled candidate formant
parameters with any set of formant parameters of said reference
speech vocabulary to provide a formant dissimilarity measure
- 42 -

between the two sets that is not unduly sensitive to the presence
of errors in either set:
f) comparing energy and spectrum balance measures for the
speech to be recognized with the reference speech vocabulary;
g) determining the syntactically allowed sequence of
vocabulary item template information and their non-linear time
alignments with the allowed sequence of vocabulary items in the
speech to be recognized that minimize a total dissimilarity
measure, said total dissimilarity measure comprising said formant
dissimilarity measure, energy and spectrum balance
dissimilarities summed over aligned frame pairs; and
h) outputting the determined sequence of vocabulary items
corresponding to the template.
13. A speaker verification method comprising the steps of:
a) extracting and storing from a reference speech
vocabulary, for each speaker to be identified, on a frame by
frame basis, formant frequencies and bandwidths, energy and
spectrum balance;
b) storing whole-word template information for said
reference vocabulary;
c) storing information defining sequences of words in the
reference speech vocabulary;
d) instructing a speaker to say a specified sequence of
words and to identify himself or herself;
e) extracting on a frame by frame basis unlabelled
candidate formant frequencies and bandwidths, energy and spectrum
balance of the speaker's words;
f) comparing sets of unlabelled candidate formant
frequencies and bandwidths with the formant frequencies and
bandwidths of the reference speech for the identified speaker to
provide a formant dissimilarity measure between the two sets that
is not unduly sensitive to the presence of errors in either set;
g) comparing sets of the energy and spectrum balance to
provide a further dissimilarity measure which is combined with
the formant dissimilarity measure to provide a total
dissimilarity measure;
h) determining the time alignment of the specified
- 43 -

sequence of words with the reference speech templates
corresponding to the speaker's claimed identity that minimizes
the total summed formant dissimilarity measure over aligned frame
pairs; and
i) measuring the equivalence between the time aligned
specified sequence of words and the reference speech templates
and determining whether the equivalence is above an acceptable
lower limit for speaker verification.
14. A method of providing a multi-voiced output comprising
the steps of:
a) extracting from a reference speech vocabulary, on a
frame by frame basis, formant parameters comprising frequencies
and bandwidths, energy, fundamental frequency, voiced and
unvoiced decision, for each of said reference speech vocabulary;
b) storing in a first means at least said formant
parameters, energy and voiced and unvoiced decision for each of
said reference speech vocabulary;
c) storing in a second means syntactic and prosodic rules
applicable to said reference speech vocabulary;
d) selecting reference speech out of said reference speech
vocabulary, and choosing a set of parameters for modifying said
selected reference speech;
e) modifying the characteristics of said selected
reference speech in accordance with said chosen set of parameters
by altering one or more of said stored formant parameters, energy
or duration or form of the excitation waveform of said selected
reference speech;
f) re-synthesizing said modified selected reference
speech; and
g) suitably analog converting and outputting said re-
synthesized modified selected reference speech.
15. A system of claim 1, further characterized by:
a) means for extracting and storing boundaries of
vocabulary items for the speech to be recognized from the speech
recognition system;
b) means for extracting and storing boundaries of
44

vocabulary items for the speech to be recognized independently of
said speech recognition system;
c) means for determining the correspondence between the
two sets of vocabulary item boundaries;
d) means for identifying and storing vocabulary item
templates of the speech to be recognized independently of said
speech recognition system;
e) means for comparting the identified sequence of
vocabulary item templates from said speech recognition system
with the corresponding independently identified and stored
vocabulary item templates within said independently extracted and
stored vocabulary item boundaries;
f) means for outputting a reliability measure of said
speech recognition system as a result of at least a portion of
the correspondence determined between the two sets of vocabulary
item boundaries and identified sequence comparison of the two
sets of vocabulary item templates.
16. The system of claim 14 further characterized by:
a) means for constraining said means for identifying a
sequence of vocabulary item templates to match said corresponding
vocabulary items identified by the independent means;
b) said means for comparing the identified sequence of
vocabulary item templates including means for passing said speech
to be recognized through a portion of said speech recognition
system at least twice.

Description

Note: Descriptions are shown in the official language in which they were submitted.

~4674S
BACKGROUND OF THE INVENTION
The present invention relates to systems and
methods for man/machine communications using formant based
speech recognition. The general need for such
communications has spurred significant research in this
field. The potential applications are diverse and include,
for example, aircraft control systems, factory control
systems, office communications and information retrieval
systems. For illustrative purposes. reference will be made
to an aircraft system, where,for example, the question "air
speed?" would result in an auditory response indicating the
measured air speed. This is achieved by recognizing "air
speed" out of a reference vocabulary using an automatic
formant based speech recognition process. Once "air speed"
is recognized, it activates a machine interface which may
for example be connected to the aircraft air speed
indicator. A measurement is made and the appropriate digits
representing the air speed are output as speech responses.
As will be recognized, the two major activities involved are
automatic speech recognition and speech output.
The central process in automatic speech recognition
is the comparison of some representation of the sounds in
the speech to be recognized to those in the reference
speech. Such comparisons are also needed in other speech
applications such as speaker verification, speaker
independent and dependent speech recognition of isolated and
continuous speech. Most current speech recognition systems
use a representation of the smoothed short-term power
spectrum (such as the output of a filter bank, low-order
cepstrum coefficients, or linear prediction coefficients)
i

1246745
estimated every 10 ms or so, and the sound-to-sound
comparision is in terms of the overall similarity of
smoothed power spectra. Such similarity measures seem to
correlate well with human judgments of the similarity of
sounds qua sounds, but do not correlate particularly well
with human judgments of the phonetic similarity of two
sounds, which is required for speech recognition purposes.
By making a pairwise comparison of the
corresponding frequencies of the first three or four
formants (i.e. matching the formants labeled Fl , F2
etc.) a similarity measure is obtained that correlates well
with the phonetic similarity of two speech sounds. It is,
however, difficult to generate reliably labeled sets of
formant frequencies in samples of speech to be recognized.
Typical errors are failure to detect the presence of a
particular formant, failure to resolve two formants that are
close together in frequency, and spurious detection of a
formant at a frequency where no formant occurs. These
errors cause a mislabeling of the remaining formants, and
rigid pairwise comparison of labeled formant frequencies is
critially sensitive to such mislabeling. It is possible to
i
partially avoid errors due to labeling while retaining some
of the advantages of formant-based comparisons by making
overall spectral comparisons but with weights being applied
that result in formants being emphasized.
~'
The present invention in one aspect proposes an
alternative method which makes direct use of formant
frequencies, by using a dynamic-programming sequence
comparison technique to compare sets of putative formant
~ 30 frequencies without first labeling them. This results in a
i
~; 2
... .

12467~5
formant-based spectrum dissimilarity measure that does not
become unduly large in the presence of missing formants,
spurious formants or coalesced formant pairs.
The process of speech output in the most important
class of speech sounds, namely unnasalized voiced sounds,
can be quite accurately modeled as an all-pole filter
periodically excited by an impulse. The impulses occur as
the vocal cords snap shut, typically a hundr~d times a
second for a man. The frequency of the impulses, known as
the fundamental frequency, largely determines the pitch of
the voice, while the frequencies of the poles of the filter,
( formants ), largely define the phonetic identity of the
speech sound. The fundamental frequency, measure of
loudness, and the pole parameters together form an extremely
compact representation of the speech signal, useful for
analysis, transmission and storage purposes.
The pole parameters can be extracted from speech
using a well established technique, linear predictive coding
(LPC), provided that the times for the occurrence of the
impulses are known. Ideally, the LPC analysis takes place
over a portion of waveform that starts at an instant of
excitation and ends before the occurrence of the next
excitation. In practice, however, these instants are not
generally known, and the LPC analysis is instead carried out
on portions of the speech waveform that-are advanced at a
regular rate without reference to the excitation instants,
the portions typically being long enough to contain several
excitation points. Conventional LPC speech analysis and
transmission systems are subject to errors in the estimation
of the fundamental frequency and the decision as to whether

lZ46745
speech is voiced or not. It is noted that the output of an LPC
analysis of any kind is a set of predictor coefficients (or,
sometimes, another set of coefficients, such as reflection
coefficients, from which predictor coefficients can be easily
derived). By treating the coefficients as the coefficients of a
polynomial, and then solving for the roots of this predictor
polynomial, the frequencies and dampings of the poles specified
by the LPC analysis can be determined. Conventional LPC analysis
is described in the text by J.D. Markel & A.H. Gray, LINEAR
PREDICTION OF SPEECH, Springer - Verlag, Berlin, 1976.
A device known as a Laryngograph was developed by
Prof. Adrian Fourcin of University College London and is
commercially available. It was originally developed to provide
visual feedback of vocal cord activity - intonation and voice
quality - in the speech training of profoundly deaf subjects.
Penny-like electrodes placed on each side of the larynx serve to
measure the radio-frequency electrical impedance across the
larynx. During voiced speech this impedance shows a variation
with time depending on the area of contact of the vocal cords.
In particular, there is a rapid decrease in impedance as the
vocal cords snap shut, and if the impedance signal is
differentiated, a negative impulse is seen at the instant of
closure, i.e. at the instant of vocal-tract excitation. Analysis
of the differentiated Laryngograph signal thus provides a
virtually error-free indication of voicing, fundamental
frequency, and the vocal-tract excitation instants, this last
information allows pitch-synchronous LPC to be carried out. A

1246745
detailed description of pitch synchronous LPC is found in the
paper by A.K. Krishnamurthy, "Two Channel (Speech and Egg)
Analysis of Formant Tracking and Glottal Inverse Filtering",
Proc. IEEE Int. Conference Acoustic, Speech, Signal
Processing, San Diego, March 1984, Paper 36.6, Vol. 3.
After the instant of closure, the vocal cords stay
closed for a period known as the closed-glottis phase. They then
open relatively slowly before rapidly closing again. The period
between one instant of closure and the next is known as a glottal
cycle, and the closed-glottis phase occupies between about one-
third and two-thirds of a glottal cycle. The conventional wisdom
is that pitch-synchronous analysis should be carried out over the
closed-glottis region only. The Laryngograph signal shows a
small positive-going bump at the end of the closed phase, so such
an analysis is possible.
The present invention proposes to confine analysis to
the closed phase for formant frequency extraction and inverse
filtering. However, for re-synthesizing speech, an analysis that
uses the whole glottal cycle is proposed. The foregoing is in
part because the closed-phase analysis inaccurately estimates the
effective formant bandwidths, which are different from the true
formant bandwidths that it derives.
The present invention can be used to recognize
isolated words or connected words. For connected word
recognition, i.e. recognition of input speech in terms of a
sequence of stored whole word templates, the sequence being
either completely free or limited by some simple syntactic
rules, knowledge of the location in time of the portion of the
input signal that corresponds to each word symbol output can be
utilized to evaluate the performance of the system. A dynamic
programming technique scores the performance showing the
reliability of the recognition of the input symbol string in
light of the correct word boundary positions, which may be
obtained by independent means (such as manual timing).
Accordingly the invention in one aspect pertains to a
speech recognition system including means for extracting and

~ 1246745
storing from a reference speech vocabulary comprised of a
plurality of vocabulary items, on a frame by frame basis, a set
of formant parameters comprising frequencies and bandwidths,
together with a measure of energy and a measure of spectrum
balance for each item of the reference speech vocabulary. Means
are provided for storing vocabulary item template information for
the reference vocabulary and means are provided for storing
information defining syntactically allowed sequences of
vocabulary items in speech to be recognized. Means extract on a
frame by frame basis, a set of unlabelled potentially errorful
candidate formant parameters comprising frequencies and
bandwidths, together with a measure of energy and a measure of
spectrum balance for the speech to be recognized. The system
further includes means for comparing sets of the unlabelled
potentially errorful candidate formants parameters with any set
of the formant parameters of the reference speech vocabulary to
provide a formant dissimilarity measure between the two sets that
is not unduly sensitive to errors present in either set.
Further, means is provided for comparing the measure of energy
and the measure of spectrum balance for the speech to be
recognized with the measure of energy and the measure of spectrum
balance for the reference speech vocabulary to provide energy and
spectrum balance dissimilarity measures. Means combine the
formant dissimilarity measure and the energy and spectrum balance
dissimilarity measure to produce local dissimilarity measures and
means are provided for identifying a sequence of vocabulary item
templates by aligning the speech to be recognized with the
reference speech vocabulary which alignment results in the lowest
total dissimilarity measure, wherein the total dissimilarity
measure is the sum of local dissimilarity measures over aligned
frame pairs of the speech to be recognized and the reference
speech vocabulary. Means are provided for outputting the
identified sequence of vocabulary item templates.
The invention in another aspect pertains to a speaker
verification system including means for instructing a speaker to
provide speech to be recognized corresponding to at least one of
a reference speech vocabulary comprised of a plurality of
vocabulary items for all speakers. Means store speaker

12~674S
. .
identities corresponding to the speaker's reference speech
vocabulary and means are provided for extracting and storing from
the reference speech vocabulary for each speaker to be
identified, on a frame by frame basis, a set of formant
parameters comprising frequencies and bandwidths, together witll a
measure of energy and a measure of spectrum balance for each
frame of each item of the reference speech vocabulary for each
speaker to be identified. There is means for storing vocabulary
item template information for the reference vocabulary for each
speaker to be identified and means for storing information
defining syntactically allowed sequences of vocabulary items for
each speaker to be identified. Means is provided for extracting
on a frame by frame basis, a set of unlabelled potentially
errorful candidate formant parameters comprising frequencies and
bandwidths, together with a measure of energy and a measure of
spectrum balance of the produced specified sequence of vocabulary
items to be recognized. There is also means for comparing sets
of the unlabelled potentially errorful candidate formant
parameters with any set of the formant parameters of the
reference vocabulary to provide a formant dissimilarity measure
between the two sets that is not unduly sensitive to the presence
of errors in either set. Means determine the syntactically
allowed sequence of reference speech templates and their non-
linear time alignments that minimize a local dissimilarity
measure comprising the formant dissimilarity measure, energy and
spectrum balance dissimilarities summed over aligned frame pairs
of the frames of the speech to be recognized and the frames of
the reference vocabulary. There is means for outputting the
reference vocabulary determined by the syntactically allowed
sequence of reference speech templates and means for identifying
the reference speech vocabulary by aligning the speech to be
recognized with the reference speech vocabulary which alignment
results in a lowest total dissimilarity measure, wherein the
total dissimilarity measure is the sum of the local dissimarility
measures over aligned frame pairs of the speech to be recognized
and the reference speech vocabulary. There is also means for
outputting a positive speaker identity corresponding to the

46745
identified reference speech vocabulary if the total dissimilarity
measure is below a predetermined acceptable limit.
A still further aspect of the invention comprehends a
multi-voiced output system comprising means for extracting from a
reference speech vocabulary set of natural speech, on a frame by
frame basis, formant parameters comprising frequencies and
bandwidths, energy, fundamental frequency voiced and unvoiced
decision, for each frame of the reference speech vocabulary set.
First means is provided for storing at least the formant
parameters, energy and voiced and unvoiced decision for each
frame of the reference speech vocabulary set and second means
store syntactic and prosodic rules applicable to the reference
speech vocabulary set. Means is provided for selecting reference
speech out of the reference speech vocabulary set, and choosing a
set of parameters for modifying the selected reference speech.
Means modify the selected reference speech in accordance with the
chosen parameters by altering one or more of the formant
parameters, energy, voiced and unvoiced decisions stored in the
first means. There is means for synthesizing the modified
reference speech using an excitation waveform of duration and
form similar to the excitation waveform of the selected reference
speech and means suitably analog converts and outputs the
synthesized modified selected reference speech.
Other aspects and features of the invention will become
more apparent from the description of preferred embodiments of
the invention herein.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of an illustrative man/
machine communications system using formant based speech
recognition.
FIG. 2, comprised of FIGS. 2a and 2b, is illustrative
of formant extraction on an all voiced phrase using pitch-
synchronous and quasi-pitch-synchronous LPC means.

-
1246745
FIG. 3, comprised of FIGS. 3a, 3b, 3c and 3d, is
illustrative of formant and candidate formant extraction for
the vowel /i/.
FIG. 4, komprised of Figures 4a and 4b-, demonstrates
scoring the r~ecognition performa~nce of the system for
connected words using known word boundary locations.
DETAILED DESCRIPTION
FIG. 1 shows a general block diagram of a
man/machine communication system illustrative of the
invention. The system may be used to identify a spoken
utterance and respond with an appropriate auditory response.
Components of the overall system may also be used separately
as speech recognizers, speaker verification systems, voice
training systems and multi-voiced output systems. As will
be recognized by those skilled in the art, it would
alternatively be possible to use a single general purpose
computer adapted to perform signal processing functions
described with respect to Fig. 1 for non-real- time
simulations.
DERIVATION OF FORMANT FREQUENCIES
Reference Speech
As a preliminary matter the reference speech has
the following parameters extracted and stored, namely, a
measure of: the formant frequencies, their bandwidths,
the energy, spectrum balance and if required (and available)

124674~;
the fundamental frequency and/or the instant in time of
glottal closures. To achieve the foregoing, the subject
providing the reference vocabulary has the electrodes of a
Laryngograph attached to his larynx. The signal received
from the electrodes (8) is suitably low pass filtered (lO)
(for example, at 5 KHz) and digitized (12). The fundamental
frequency and the instants of glottal closure are extracted
in frame delimiter(14) when the Laryngograph signal i5
available. The speech signal is simultaneously low pass
filtered (20) and digitized (22). The frame delimiter
further synchronizes the selection of frames with the
glottal closure instants. When a Laryngograph signal is not
present (as is the case when speech to be recognized is fed
into the system), the frame delimiter provides heavily
overlapped windows to the speech spectrum analyzer (24).
Speech spectrum and fundamental freguency analyzer
(24) provides a pitch synchronous LPC analysis on the
incoming speech (tenth order analysis on time-differenced
speech sampled at lO KHz). A detailed description of pitch
synchronous LPC analysis is contained in the Krishnamurthy
reference cited earlier. The output of analyzer (24) is a
set of predictor coefficients, the spectral energy and the
spectrum balance. In addition, the fundamental frequencies
and/or glottal closure times are output. The output
predictor coefficients are passed to formant extractor (26).
The formant extractor (26) uses root-solving on the
predictor coefficients and thereby extracts the formant
frequencies. Root-solving analysis is known in the art and
is described in the earlier Markel reference. The bandwidth
of the formant frequencies are also available from the LPC
analysis. The functionality of the frame delimiter (14),

124~745
the analyzer (24), and the format extractor (26) may be
resident in a single microprocessor, such as, a Texas
nstrument TMS 320. Whole-word reference template store
(28) stores the formant frequencies and bandwidths, the
energy, and spectrum balance. The template store (28) is
composed of RAM memory and disk memory adequate to store the
reference vocabulary set. Additionally, a reference store
(30) is also provided for storing the reference vocabulary
used in the re-synthesis process. As will be recognized by
those in the art, a single storage means, appropriately
designated, may be utilized for both the whole-word
reference template store (28) and the reference store for
output (30).
Generally, a speaker will supply the reference
vocabulary speech set and by processing it through the
components described above the whole-word reference template
store (28) will be filled. Similarly, the reference store
for output (30) will also be filled by processing those
words which may be required during re-synthesis. In our
illustrative embodiment the whole-word reference template
store (28) does not retain the fundamental frequency or
voiced/unvoiced decision (provided by the analyzer (24))
while the reference store for output (30) contains these
items.
An alternative to the use of natural speech
processed as described above is to use formant based
synthetic speech. In this instance it is of course an easy
matter to extract the relevant formant information.
Any recognition system will only perform as
11

- 124~i745
reliably as the reference information provided to it. Thus,
in the present invention, a test mode feeds the reference speech
into the multi-voiced output segment of the system as described
later. By listening to the recreated speech it is possible
to compare how well the system duplicated the input reference
speech, thereby giving some degree of verification of the
accuracy of the formant extraction process.
Speech to be recognized
Once the reference sneech parameters are stored,
the system is ready for input speech to be recognized. The
input speech is suitably low pass filtered (20) and digitized
(22) as per the reference speech. The analyzer (24) is switched
from pitch-synchronous covariance method LPC to quasi-pitch-
synchronous covariance method LPC. In quasi-pitch-synchronous
LPC the normalized prediction error will normally be smallest
when the covariance analysis window coincides with the
closed-glottis portion of a glottal cycle. The Markel
reference, referenced earlier describes the foregoing
process. In the present invention, the analyzer (24)
carries out such an analysis on a sequence of heavily
overlapped windows provided by the frame delimiter (14) and
selects the window that gave rise to the lowest normalized
error power. For this window, the predictor polynomial is
solved and in a fashion similar to the reference speech, the
formant extractor (26) extracts the formant frequencies.
An illu~trative parameter for the window length is 5 ms (40
samples at 8 KHz sampling), the window advance is 1.25 ms,
ten such overlapped windows are scanned for minimum error
power, and five more windows are then analyzed before the
most recent ten are rescanned for minimum error. The out~ut
12

1246745
frame rate is thus 5 x 1.25 ms = 6.25 ms, though with some
frame-to-frame jitter. Typically a fifteenth-order analysis
is used giving rise to up to seven candidate formants. In
one preferred embodiment, six candidate formants are stored,
; the formant with the lowest Q being dropped when there are
seven formants. Similar parameters as those extracted from
the reference speech, namely the energy and spectrum balance
are also extracted and stored. Though not shown in the
block diagram of Figure 1, a buffer storage may be provided
to store the parameters of each input speech frame. The
buffer may be utilized to store succeeding frames until they
are required by the template sequence matcher (34).
Sample tests on speech indicate that, in nan-nasal
voiced sounds the frequencies and bandwidths are extracted
reliably using the quasi-pitch-synchronous LPC means as
described above. Figure 2 shows sample outputs utilizing
both pitch-synchronous and quasi-pitch-synchronous LPC on
the phase ~we were away a year agon.
An alternative to the foregoing quasi-pitch
synchronous LPC method would be to use the Laryngograph-
aided pitch-synchronous extractor described earlier. This
alternative is only practical where the subject is prepared
to wear a Laryngograph. As can be seen by Figure 2, the two
analyses provide similar results, though the pitch
synchronous method provides better continuity in the formant
extraction of F4 and F5.
A further alternative, to the overlapped-window
formant extraction, is a recursive lattice formulation
13

1246745
method as described in John M. Turner's article, ~Overview
of .Recursive Least Square Estimates and Lattice Filters",
Technical Report M-736-2, Department of Electrical
Engineering, Standford University, USA. Another alternative
is to derive candidate formant frequencies from a non-~PC-
based representation. It should also be noted that formant
extractor (26) can also utilize a peak-picking method to
extract formants rather than the root solving method
outlined earlier.
FORMANT COMPARISION
The starting point for the comparison of formants
is two sets of formant frequencies (typically five or more
in each set) obtained from the reference and speech samples.
The aim is to find the best correspondence between the two
sets given the kinds of errors described earlier such as the
spurious detection of a formant which may be present in one
or both sets. Figure 3 illustrates the formant comparison
problem for formant based speech recognition. Figure 3a)
provides the typical output of a pitch-synchronous ~PC
analysis on the vowel /i/. The energy within the speech is
plotted on the y axis on a logarithmic~scale, against the
frequency which appears on the x axis. As the pitch-
synchronous LPC method is very (though not absolutely)
reliable we can in this instance label the peaks on the
graph as the formants. Passing the same speech signal
through another analysis might result in the output of
Figure 3b). The peaks this time are labelled as candidate
formants and the reliability of the extraction process is
14
$

12~674S
not as high.
Figure 3c) is the negative second differential of
the output of Figure 3a). In this instance, the formants
are clearly highlighted and readily recognizable. The same
process has also been carried out on Figure 3b) and results
in the output shown on Figure 3d) The candidate formants
(CF) are readily visible in the graph.
Since Figures 3c) and 3d) represent the same
speech sound the formant sets of each should match if
compared. It is, however, evident that CF2 is spurious and
that formant F4 was missed. Thus, any direct comparison
would result in a mis-match and the two speech segments
would not be considered as one and the same.
The present invention however, overcomes the
foregoing by considering the formant comparison to be a
sequence comparison problem of the kind that is susceptible
to solution by dynamic programming. Given a cost of
"deleting" a member of one set, i.e. of asserting that it
corresponds to no member of the other set, and a cost of
"assigning" a member of one set to a member of the other,
i.e. of asserting that they correspond to the same formant,
one can determine the minimum-cost relationship between the
two lists. The deletion cost of a given candidate is
related to some estimate of the probability that the
candidate represents a real formant, or, alternatively, to
the probability that the corresponding formant on the other
set will not have missed. Similarly, the assignment cost is
related to the frequency difference between the two formant

12467~5
candidates being assigned.
There now follows a formal description for the
comparison of a set of formant frequencies from a frame of
reference speech to a frame of candidate formant frequencies
from some input sample speech to be recognized or analysed.
Each microprocessor running in parallel is programmed to
provide the following functionality:
A reference frame consisting of M candidate
formants labeled l...m...M, being compared with an input
frame of N candidate formants labeled l...n...N. The
dissimilarity between them is defined as D(M,N), where D is
defined by the recursion relation,
( D(m - l,n - 1) + ~(m,n) (i)
( D(m - l,n)- + ~(m,n) (ii)
D(m,n) = min (D(m,n - 1) + ~(m,n) (iii)
( D(m - 1, n) + di(m)(iv)
( D(m,n - 1) + dr(n)(v)
for M, N- m, n - 1, with
D(m,o) = D (m - 1, o) + di(m)
D(o,n) = D (o, n - 1) + dr(n)
and D(o,o) = o
where ~(m,n), the cost of assigning input candidate formant
m of frequency fi (m) to reference formant n of frequency
fr (n), is a function whose value increases monotonically
with the absolute value of ¦fi (m) - f (n)¦ and di (m)
and dr (n) are the costs of deleting input candidate m and
reference formant n respectively.

124674S
Cases (iv) and (v) above correspond to the deletion
of input candidate m and reference candidate n respectively.
Case (i) corresponds to the assignment of input candidate m
to reference candidate n. Cases (ii) and (iii) also
correspond to this assignment, but in these latter cases a
double assignment is implied; that is, in case (ii), for
example, reference candidate n is assigned both to input
candidates m and m-l. Such double assignment should be
permitted if it is possible for two formants of similar
frequencies to have given rise to only a single formant
candidate. If the reference set is known to be correct,
then case (iii) should not be allowed.
The foregoing process of formant comparisons
occurs within the formant set dissimilarity measure (32)
which is composed of one or more fast microprocessors, such
as the Texas Instruments TMS 320,operating in parallel. The
microprocessors are suitably programmed to implement the
relationship described above, and may be able to handle the
production of dissimilarity measures between the input
speech and approximately five words obtained from the whole-
word reference template store (28).
If all that is needed is a measure of the
similarity of the two sets, then the process can stop when
D(M,N) has been computed. This would normally be the case
if the process was being used for speech recognition. If,
however, the details of the interpretation of one set in
terms of the other are of interest, then we need a set of
back-pointers indicating for each pair (m,n) which of the
options (i) to (v) above was chosen. When (M,N) is reached,
.
~ 17

lZ46745
a traceback using the pointers will reveal the optimum
interpretation. Thus, if the reference set is known to be
correct, complete and labeled, and the input set is unlabeled and
believed to be errorful, the traceback will indicate which of the
input candidates should be deleted and how the rest should be
labeled, assuming that the two sets of formants are in fact
representing the same speech sound.
The dissimilarity measure D(M,N~ can be used in place of
the usual spectrum difference measures in dynamic-programming
time alignment or in hidden Markov methods that use continuous
parameter sets. Hidden Markov methods are for example, described
in the Abstract, B.H. Juang, S.E. Levinson, L.R. Rabiner and
M.M. Sondhi, "Continuous Density Hidden Markov Models for
Speaker-Independent Recognition of Isolated Digits~, J. Acoust.
Soc. Am., Suppl. 1, Vol. 76, Ul, Fall 1984.
The formant-to-formant assignment cost, ~m,n), should
be a monotonically increasing function of the absolute difference
in the two formant freguencies. In one of the embodiments
it is proposed as being proportional to the square of
the difference in the frequencies, the frequencies being
expressed on the technical mel scale (linear to 1 KHz and
logarithmic above 1 KHz). In the input speech the
deletion costs should ideally be related to the probability
that the candidate formant corresponds to a real formant.
If a peak-picking alternative is used within the
formant extractor (26), it is reasonable to relate the
18

1246745
deletion penalties to the prominence of the peak. However
for the root-solving alternative described earlier, the
penalties are related to the resonance bandwidths, and as
formant bandwidths generally increase with frequency, the
deletion penalty is related linearly to the bandwidth of
each resonance divided by its frequency (i.e. to the Q of
the resonance).
In the case of the reference speech the deletion
penalty of each formant may be related to the probability of
its having been missed in the input speech~ Again, for
peak-picking the prominence of the peak may be used and for
root-solving the deletion penalties may be-re~ted to the Q's
of the resonances. In addition, however, since the
determination of F4 and F5 is known to be unreliable for
input speech, (assuming quasi-pitch-synchronous LPC), the
deletion penalties on these two formants in the reference
set is reduced. Similarly, the deletion penalties on Fl in
voiceless sounds are reduced. Finally, in an attempt to
ensure that no reference frame is intrinsically easier to
match to than any other, the deletion penalties in each
reference frame are scaled such that they sum to a constant
value.
The dissimilarity measure, D, provides an
indication of the differences between two sets of formant
frequencies. It is therefore sensitive to differences in
vocal tract configuration, but it takes no account of source
differences between two speech sounds. Consequently, it is
useful to supplement D with some measure of voicing and
overall loudness. One set of parameters used to represent
source behavior are the mel-scale cepstrum coefficients, CO

124674S
and Cl CO is an indication of overall energy, while C
indicates spectral slope or balance and thus correlates
with voicing. In the interests of computational
efficiency, alternative measures,such as the LPC CO & Cl,
could be used.
~ n the present invent~on, the energy and spectrum
balance dissimilarity measure (33) perform the foregoing
function. The CO and Cl difference information is combined
with D. To estimate the magnitude of the three quantities,
a set of words from one speaker was aligned to a
corresponding set of reference words using conventional
spectrum matching in the time alignment, and the average
values of the three kinds of difference information for
aligned frames were determined. An initial combination of
the CO' Cl and formant information in the ratio 1:2:5 after
each had first been scaled by their average values was used
in the illustrative embodiment. The ratio may be adjusted
optionally based upon the input speech. For example,
formant comparison is not possible with certain frames that
contain no formant information (generally because they
correspond to silence), so in these cases the C and C
difference information is scaled up to compensate. As will
be recognized by those in the art, it is a relatively simple
task to provide the foregoing functionality within each of
the microprocessors used for the formant dissimilarity
measure (32).
The template sequence matcher (34) in Figure 1 is
composed of a dedicated microprocessor which obtains
information from the syntax constraints store (36). The
template sequence matcher (34) also obtains the frames from
;
r~

1246~45
the formant extractor (26) or its buffer. The matcher (34) also
provides the control to reset the dissimilarity measures (32,
33). Dynamic programming techniques of comparing sets are
described in the article by J.S. Bridle, R.M. Chamberlain and
M.D. Brown, "An Algorithm for Connected Word Recognition", Proc
IEEE Int. Conf. Acoust., Speech and Signal Processing ICASSP-82,
Paris, ~rance, May 3 - 5, 1982, pp. 899 - 902.
The syntax information assists in the recognition by
constraining, for example, which words may follow other words.
Where isolated words are to be recognized the syntax may simply
prescribe that a word is bounded by silence of arbitrary
duration. The recognized word sequence output ~40) relays the
indices of the words corresponding to the templates which
provided the best match based on the dissimilarity function to
the I/O port of another device. In our illustrative embodiment,
the machine (42) may be, for example, the air speed indicator.
Additionally, the recognized sequence may also be output in
visual form. As will be evident to those skilled in the art, the
system described above could be used for speech recognition
without the necessity of the additional components described
below.
Isolated-word speaker-independent digit-recognition
tests were run using formant-based templates on the foregoing
part of the system as described herein. Recognition
performance on the 1500-word set taken from
the ten male speakers was comparable with the best
results obtained with conventional spectral matching and spectral
adaptation. Recognition performance on the female speakers,

- 1246745
however, was less effective, in part because the quasi-pitch
synchronous LPC analysis window used was longer than one
glottal cycle for typical female speech. The extraction of
formant candidates would thus be less successful in that
instance.
In both the isolated-word and the connected-word
tests the computation time taken for the formant matching
could be significantly reduced without degrading performance
significantly by excluding the higher-frequency formants,
which are known-to have little influence on human phonetic
~udgements.
APPLICATIONS
As noted earlier, if the reference formants are
labeled then the formant comparison with traceback will
provide labels for those input formant candidates that were
assigned to reference formants. Such an analysis might be
used to characterize the supra-glottal behavior of a speaker
for SPEAKER VERIFICATION or for speaker adaptation in speech
recognition.
Of course, for the labeling to be error-free the
formant patterns of the reference speech and the speech to
be analyzed have to be reasonably similar. Analysis of
naturally produced speech using reference speech taken from
the same talker, the reference formants being derived from a
pitch-synchronous analysis, can thus be used for speaker
verification or alternatively, for the analysis of possible
;'
:
~,
~ 22

~246745
changes in a subject's formant patterns under, for example,
high gravitational stress. This is done by storing a
reference vocabulary from each of the potential speakers to
be recognized. Once the speaker identifies himself, his
reference vocabulary set is used for comparison. The
speaker is asked to voice a sequence o one or more-words, and
if an acceptable level of equivalence is found between the
speakers speech and the reference vocabulary, the speake~s
identity can be confirmed.
;:
It should be noted that formant frequencies are a
good basis for predicting human phonetic judgments. Intra-
speaker variation due to physical or emotional stress tends
to affect the source spectrum rather than formant
frequencies, and signal degradations introduced by speech
transmission systems largely preserve formant frequencies -
if they did not, they simply would not work. There is thus
a strong motivation for seeking an error-resistant way of
using formant frequency information in speaker verification
systems as is proposed above. One further variation can
consist of aligning the input speech with his reference
templates (the alignment requiring all the parameters
including C and Cl as described for recognition) and then
using only a formant comparison to determine the measure of
equivalence. As can be seen by those skilled in the art the
essential components for the speaker verification described
herein are available within the speech recognition portion
of the present invention.
SPEECH OUTPUT
There are two commonly used methods of carrying out
~,
23

~~; `
lZ4{~745
LPC analysis of- speech, -namely the autocorrelation and
covariance methods. The autocorrelation method is inexact,
but it is frequently preferred because of its lower
computational demands. .~ When relatively long portions
of speech are analyzed there is little difference between
the results of the autocorrelation and covariance methods of
analysis, but the latter method occasionally leads to
unstable filters being generated. When analysis is carried
out on the shorter portions of speech delimited by the
impulses in the Laryngograph signal, however, the
autocorrelation method gives results that are significantly
different from the exact analysis provided by the covariance
method. The covariance method therefore is preferred though
it remains relatively expensive computationally and subject
to occasional instabilities. The instabilities can however
be identified and corrected at some cost in computation in
the following manner. When the roots of the predictor
polynomial are determined, those poles that cause
instabilities have negative dampings. They can be made
stable without affecting the power spectrum of the filter
specified by the LPC analysis simply by reversing the signs
of the dampings of the unstable poles and then recomputing
the predictor coefficients. In this way, a filter
specification is generated which has the same power spectrum
as the original unstable filter, but which is now stable.
An alternative for the extraction of formant
information to that described above is by "peak picking".
This is computationally less expensive and can be used when
only formant frequencies and not their bandwidths are
required. The log power spectrum described by the predictor
24

,~
1246745
coefficients can be easily computed. By differentiating
this function twice with respect to frequency, and then
picking the negative peaks in the resulting spectrum, the
formant frequencies can be estimated. By carrying out a
three-point quadratic fit to each peak of a 128-point
spectrum, the formant frequencies may be derived that show
only insignificant differences from the values obtained by
solving the predictor polynomial. For certain applications
particularly where low bit-rate communication is necessary
only formant frequencies would be transmitted. Speech with
fixed formant bandwidths remains intelligible when
recreated, and indeed better than conventional LPC, though
much of the naturalness is lost.
The stabilized pitch-synchronous covariance-method
LPC analysis leads to re-synthesized speech of much greater
naturalness and intelligibility than conventional pitch-
asynchronous LPC. Although the computationally expensive
analysis and the need to use a Laryngograph make the method
unattractive for real-time communication systems, it is
ideally suited for speech output applications such as
auditory feedback from the speech recognizer described
herein.
The accurate formant analysis provided by the
speech synthesizer portion of the system described herein,
leads to a much more efficient coding scheme than is
possible with current LPC systems. For example, most LPC
systems use ten predictor coefficients, equivalent to five
formants. There is however little loss in quality of the
re-synthesized speed in assigning fixed values to the
highest two formants, leading to a specification in terms of

1246745
only six coefficients. Similarly, the human ear is
enormously more sensitive to formant frequencies than to
formant bandwidths, so an efficient bit-assignment strategy
devotes more bits to the specification of frequencies than
bandwidths. By contrast, when predictor coefficients are
coded directly, all coefficients are specified with equal
precision, and formant frequencies and bandwidths are
specified with equal precision.
When sets of formants are derived by pitch-
synchronous analysis, it sometimes happens that for a givenglottal cycle one or more formants are missed. For re-
synthesis purposes, such errors axe perceptually
unimportant, but for other tasks, such as the analysis of
reference speech or the interpolation between two
utterances, the absence of one formant will lead to serious
errors caused by the mislabeling of the remaining formants.
In the present system, the continuity of formant tracks over
time is exploited using the robust formant comparison
described earlier to determine which formants are missing so
that they can be replaced by corresponding formants in
adjacent glottal cycles.
The control available with the re-synthesis can be
exploited in the concatenation of words recorded in
isolation. Appropriate intonation patterns, durations and
energy contours can be imposed on the words to be
concatenated into a particular phrase. Emphasis can be
given to words that require it (such as in the dlalogue:
"did you say one five two?", "no, one nine two"), and
further appropriate coarticulations may be generated by
26

:12~6745
smoothing formant tracks across word boundaries.
Reverting to the illustrative embodiment for the
speech re-synthesis, a signal is transmitted from the
machine (42) and/or the recognized word sequence output
(40). The signal provides information to the output
message constructor (46), which assembles the requisite
information for a response to the recognized word sequence.
The information assembled includes the required audio output
as selected from the reference store for output (30). As
indicated earlier, the reference store (30) contains speech
(typically words) input and analyzed earlier. The speech is
represented in the form of formants (and fundamental
fréquency, etc), as extracted by the Laryngograph and
formant extractor (26).
As is known in the art, syntax and prosody rules
are required when one or more words are to be output in
sequence to make the sequence sound ~naturaln. The syntax
and prosodic rule store (38) provides this previously-stored
informaticn to the message constructor ~46).
The voice modifier (48) performs simple arithmetic
operations on the formant and other parameters in
accordance with the syntax and prosody rules assembled by
the message constructor (46). A re-synthesizer (50)
converts the formants as excited by the pitch information
into a sequence of words. The sequence of words is passed
through a D/A converter (52) and suitably low pass filtered
and fed into a loudspeaker. As will be recognized by those
skilled in the art, a single microprocessor (for example 48)
may be suitably programmed for the purposes of providing the
:
27

lZ~;7~5
functionality required for the production of the speech
output as described above. The re-synthesizer (50) can also
be an LPC synthesizer, such as those commercially available.
The re-synthesis described herein allows the
speaker characteristics to be easily modifed. This has
heretofore only been done using synthetic speech. In the
present invention however, natural speech as processed
earlier may be modified. For example, by raising the
frequencies of the lower formants by say about 15% and those
of the higher formants by some smaller amount, by doubling
the fundamental frequency, and by adding some noise to the
impulse train used as the excitation, an acceptable female
voice can be generated from a male original. Multi-voice
output of this kind is useful in many applications, for
example, when several different kinds of information
(routine information and urgent warnings, for example) are
to be communicated to a listener but may be distinguishable
from each other. Conventional LPC would require two copies
of the speech to be stored, and the anaIysis of female speech
is, in any,case, generally quite unsuccessful, since LPC
analysis of-female (-or children's) speech is particularly
difficult. The multi-voiced output may also be used by
providing suitable v*riations in the prosidic characteristics
in combination with modifying the formant frequencies to provide
other voices, such as those resembling cartoon
characters.
Other examples of the multi-voiced speech
converting a typical English speaking Canadian male voice
include:

124~745
(a) fixing the pitch at 90 Hz and deriving a "robotic"
voice;
(b) lowering the fundamental by 30% and the formant
frequencies by 10% to make a particularly large male;
(c) raising or lowering Fl by 35% to provide a male
speaking with a different accent and voice quality.
The method described above may be used to derive
reference material for the speaker verification system
described, which uses the robust formant-based spectrum
comparison technique.
Applications of the multi-voiced output in the area
of medical and basic speech research are numerous inlight of
the ease with which features in the speech signal allow the
natural-sounding speech to be manipulated. For example,
syllablestimuli that lie between /ba/, /da/ and /ga/ may be
generated using the present system.
PERFORMANCE EVALUATION
The present invention may be used to measure
its own performance or that of another system providing
connected word recognition. In general, speech recognition
researchers need to evaluate the performance of their
systems as accurately as possible in order to determine the
relative effectiveness of alternative strategies. It is
also useful for them to know in detail the kinds of errors
that their systems make. Potential users need to evaluate
29

1246~74S
performance in order to obtain an accurate prediction of the
reliability of a system on the job. In their case,
knowledge of the details of the errors being made might
permit them to avoid confusable items in their vocabularies
or to design error-correcting protocols. In both cases, the
evaluation process is time consuming. In the case of system
development the problem is particularly severe, since the
re-use of the same recorded database many times while
developing a system can be dangerous: fresh speech should
be supplied frequently if the possibility of adapting the
system to the peculiarities of the database is to be
avoided. It is consequently highly desirable to extract as
much information as possible from each performance test.
Where isolated-words are to be recognized the
evaluation process is relatively straightforward in that
single words are supplied and the r~esponseeither corresponds
to the word input or does not.
In measuring the performance for connected words
the problem is ~ifferent and may be expressed as follows: an
input word sequence that would be correctly represented by
the word symbol sequence al, a2.... an will cause the
recognizer to respond with the symbol sequence ~ 2... ~n~
where M is not necessarily equal to N, and where the beta
sequence may contain 'non-word' symbols representing, for
example, breath noise and a 'wildcard' symbol for portions
not matching any of the templates. The problem is to find a
way of comparing the beta string to the alpha string that
will provide the best indication of the recognition.
At one extreme, unless the two strings are

124~745
identical in length and content, the entire output is
considered wrong. This has the appeal of being simple and
of corresponding to the situation in most applications,
where a single error will invalidate the whole input. As a
performance measure, however, it makes rather poor use of
the data. Given a suitable statistical model of error
distributions, one can make a more accurate estimate of the
underlying probability of a string containing at least one
error from estimates of the probabilities of each symbol
being in error than one can from just knowing the proportion
of strings found to contain at least one error. The whole-
string measure provides, moreover, no information on the
kinds of errors being made.
An alternative approach is to attempt to match the
beta string to the alpha string on a symbol-by-symbol basis.
In some cases there is an obvious interpretation: for
example if "525990" is recognized as "925990", it may be
reasonable to assume that the first ~5" was erroneously
taken to be "9" and that the other digits were correctly
recognized; similarly, if the same input string was
recognized as "25990", it might be assumed that the only
error was an omission of the first digit; but if the output
string is "5259950~, it is rather less clear how many errors
should be counted and what kinds of errors they were. Given
some assumptions about the relative probabilities of symbol
omission, insertion and replacement, it is possible to use
dynamic programming string matching techniques to compute
the most probably interpretation of the errors that were
made. However, such interpretations will be unreliable,
and the estimates of correct symbol interpretation obtained
in this way will tend to be too high, since the string
31

124674S
matching algorithm will always make the most favourable
interpretation of the output.
In the present invention it is proposed that the
location in time of the portion of the input speech that
corresponds to each beta symbol recognized be made available
and stored. The boundary information is available in the
present system as it is, at least in principle, from other
connected word recognizers. Thus for instance, in the
foregoing example where the output string was ~5259950" and
the symbol boundary information was available the correct
positions of the beginningand end points czn be aligned
(assuming of course that the correct input and its word
boundaries are known)~ Figure 4 shows how the alignment
process would work for the example above.
The matching of word boundaries found by the
recognition algorithm to those known to exist in the input
can be carried out automatically using dynamic programming.
A boundary in one set is assign~d to a boundary in the other
set with a cost that is a monotonically increasing function
of the separation in time between the two boundary
positions, or it can be left unassigned and in this way
incur a deletion cost. The least-cost explanation of the
boundaries found can then be determined by applying the
following recursion relation:
~D(m - l,n - 1) + ~(m,n)
D(m,n) = min ~D(m - l,n) + d~(m)
(D(m,n - 1) + d~(n)
for 1 -m M and 1~- n-N

~;~4~74S
with D(m,O) = D(m - 1,0) + da (m)
D(O,n) = D(O,n - 1) + d~ (n)
and D(O,O) = O
where d~(m) is the cost of postulating that the recognition
algorithm found nothing to correspond to the m'th word
boundary; d~ (n) is the. cost of postulating that the n'th symbol
boundary found by the recognition algorithm corresponds to
no actual word boundary; and ~(m,n) = ~(fa(m), f~(n)) is the
assignment cost of the m'th word bounaary occurring at frame
number fa (m) to the n'th symbol boundary found by the
recognition at frame number f~ (n), where ~(f~ , f~ ) is a
function whose value increases monotonically with ¦f~ - f~¦.
When the minimum-cost interpretation of all the
boundaries, D(M,N), has been determined, the set of boundary
assignments that lead to it can be traced back.
In one of the embodiments, the assignment cost
function, ~(fa ~ f~ ), can be taken as (fa - f~)2, were f
and f~ are frame numbers at a 6.4 ms frame rate. The
deletion costs d~ and d~ are set to 1000 when the boundary
being deleted delimits a word symbol. For the alpha string
this is always the case, since the alpha symbols all
correspond to words in the vocabulary; but the beta string
may contain non-word symbols (breath noise, wildcard, etc.)
whose boundaries are considered less important and so are
given a reduced deletion cost, namely 100 units. Note that
apart from this minor detail the boundary alignment process
takes no account of the identities of the symbols between
the boundaries.

lZ46745
When the portions of the recognition output that
correspond most closely to the known positions of words in
the input have been determined, the corresponding symbols in
the alpha and beta strings can be compared. The following
possibilities occur:
a) An alpha symbol is matched to an identical
symbol in the beta string - correct recognition.
b) An alpha symbol is matched to a non-identical
symbol in the beta string - recognition error.
c) An alpha symbol is matched to no symbol
representing a word in the beta string - deletion error.
d) A beta symbol representing a word is matched to
; no symbol in the alpha string - insertion error.
e) Two word symbols in the beta string are matched
to a single symbol in the alpha string, and one of the beta
symbols corresponds to the alpha symbol. This is the only
ambiguous condition found; and is taken to be a correct
recognition together with an insertion error.
By subjecting the results to the foregoing
comparision, the present systems performance may be
evaluated and the system parameters adjusted. For example,
the recognition vocabulary may be adjusted to exclude those
words which are prone to be misinterpreted.
A unique application of the system is to utilize
34

1246745
the same analysis process as described above as a means for
comparing the performance of different connect-word speech
recognition systems. This is done by having the system to
be tested output the same parameters, namely, the word
boundaries and the recognized words. If these are not
available, the method is inapplicable. Using the means
described above, the performance of any recognition system
can be scored. The advantage is that when recognition
performance is poor, operators tend to dismiss the whole
output as wrong whereas the present method succeeds in
showing that 30 or 40% of the recognition was in fact
correct. A definite advantage in using the method is that
arbitrary decisions will not be subject to unconscious
biases that may occur when an evaluator is hoping that one
particular recognition scheme will prove better than
another.
~:
The evaluation method depends on the user knowing
the positions of the boundaries between the words in the
database. They can, of course, be marked by hand, and the
effort required to do this may not be unreasonable if the
database of words is to be widely used. However, automatic
I
'~ methods of labeling are also available. Moreover, if the
;~ recognition system being used has provision for the use of a
; suitable syntax, the recognizer can be constrained to
consider only the word sequence corresponding to the alpha
string, and in this way, subject perhaps to a few manual
corrections, the recognition system can be made to determine
the boundary positions that will later be used to check its
own recognition performance.
As may be noted by those skilled in the art, the
:

1246745
dynamic progamming sequence required for evaluation as
described above is very similar to the dynamic programming
sequence utiliæed for formant comparison. Additionally when
the system is being utilized in its performance mode, there is
no requirement for the formant comparators to be operating.
Thus the functionality for the performance mode may be
implemented on the same hardware as that used for the
recognition process.
The above embodiments are illustrative of the
present invention.
.
, :~
36

Representative Drawing

Sorry, the representative drawing for patent document number 1246745 was not found.

Administrative Status

2024-08-01:As part of the Next Generation Patents (NGP) transition, the Canadian Patents Database (CPD) now contains a more detailed Event History, which replicates the Event Log of our new back-office solution.

Please note that "Inactive:" events refers to events no longer in use in our new back-office solution.

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Event History , Maintenance Fee and Payment History should be consulted.

Event History

Description	Date
Inactive: IPC assigned	2013-01-21
Inactive: IPC assigned	2013-01-21
Inactive: IPC assigned	2013-01-21
Inactive: IPC assigned	2013-01-21
Inactive: First IPC assigned	2013-01-21
Inactive: IPC assigned	2013-01-21
Inactive: IPC assigned	2013-01-21
Inactive: IPC expired	2013-01-01
Inactive: IPC expired	2013-01-01
Inactive: IPC expired	2013-01-01
Inactive: IPC expired	2013-01-01
Inactive: IPC expired	2013-01-01
Inactive: IPC removed	2012-12-31
Inactive: IPC removed	2012-12-31
Inactive: IPC removed	2012-12-31
Inactive: IPC removed	2012-12-31
Inactive: IPC removed	2012-12-31
Inactive: IPC deactivated	2011-07-26
Inactive: IPC deactivated	2011-07-26
Inactive: IPC from MCD	2006-03-11
Inactive: IPC from MCD	2006-03-11
Inactive: IPC from MCD	2006-03-11
Inactive: IPC from MCD	2006-03-11
Inactive: IPC from MCD	2006-03-11
Inactive: First IPC derived	2006-03-11
Grant by Issuance	1988-12-13
Inactive: Expired (old Act Patent) latest possible expiry date	1986-03-04

Abandonment History

There is no abandonment history.

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
MELVYN J. HUNT

Past Owners on Record
None

Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.

Documents

To view selected files, please enter reCAPTCHA code :

To view images, click a link in the Document Description column (Temporarily unavailable). To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Filter

Download Selected in PDF format (Zip Archive)

Download Selected as Single PDF

Document Description	Date (yyyy-mm-dd)	Number of pages	Size of Image (KB)
Claims	1993-08-19	9	345
Cover Page	1993-08-19	1	14
Abstract	1993-08-19	1	28
Drawings	1993-08-19	4	79
Descriptions	1993-08-19	36	1,177

Language selection

Menus

English Abstract

Event History

Abandonment History

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.

Patent 1246745 Summary

English Abstract

Event History

Abandonment History

Your request is in progress.Requested information will be availablein a moment.Thank you for waiting.

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.