Language selection

Search

Patent 2251502 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent Application: (11) CA 2251502
(54) English Title: DIGITAL SPEECH ACQUISITION, TRANSMISSION, STORAGE AND SEARCH SYSTEM AND METHOD
(54) French Title: SYSTEME ET METHODE D'ACQUISITION, DE TRANSMISSION, DE STOCKAGE ET DE RECHERCHE NUMERIQUES DE SIGNAUX VOCAUX
Status: Dead
Bibliographic Data
(51) International Patent Classification (IPC):
  • G10L 19/00 (2013.01)
  • G10L 21/02 (2013.01)
(72) Inventors :
  • LUTZ, MARC (Canada)
  • VANGE, MARK (Canada)
(73) Owners :
  • LANCASTER EQUITIES LIMITED (Barbados)
(71) Applicants :
  • LANCASTER EQUITIES LIMITED (Barbados)
(74) Agent: GOWLING LAFLEUR HENDERSON LLP
(74) Associate agent:
(45) Issued:
(22) Filed Date: 1998-10-26
(41) Open to Public Inspection: 2000-04-26
Availability of licence: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): No

(30) Application Priority Data: None

Abstracts

English Abstract



A digital speech system processes acquired speech to determine speech
element information, such as the phonemes in the speech, and associated the
prosody
information. This determined information is then encoded for transmission
and/or storage.
The encoded information is decoded to recover the speech element information
and the
prosody information which can be provided to a speech generator to construct a
facsimile of
the original speech. Also, the speech element and prosody information can be
searched to
locate words or phrases of interest in the speech.


Claims

Note: Claims are shown in the official language in which they were submitted.



-18-

We claim:

1. A digital speech system, comprising:
speech element determination means operable on speech in a digital format to
determine speech element information from said speech;
speech prosody determination means operable on said speech to determine
prosody
information from said speech;
encoding means to encode speech information comprising said determined speech
element information, said determined speech prosody information and timing
information
relating thereto in a digital form;
decoding means to decode said encoded speech information to obtain said
determined
speech element information, said determined speech prosody information and
said timing
information;
comparison means to compare said determined speech element information and
said
determined speech prosody information to a database of speech elements to
select speech
sound elements which correspond thereto; and
speech generating means to assemble said selected speech sound elements to
construct
a facsimile of said speech.
2. The digital speech system of claim 1 further comprising:
speech acquisition means to acquire an analog electronic representation of
said
speech; and
digitization means to convert said analog representation of said speech to
said digital
format.
The digital speech system of claim 1 further comprising:
an output means to produce an output of said facsimile in a manner audible to
a user.
4. The digital speech system of claim 1 wherein said comparison means further
comprises recognition means to identify undesired speech characteristics, said
selection of


-19-

speech sound elements being performed to reduce the presence of said
identified undesired
speech characteristics in said facsimile.
5. The digital speech system of claim 1 further comprising search means
operable to
receive an input representing a word or phrase of interest and to examine said
determined
speech element information to locate occurrences of said word or phrase
therein.
6. A method of acquiring and constructing digital speech, comprising the steps
of:
(i) examining digitized speech to determine speech element information
relating to
said speech;
(ii) examining said digitized speech to determine prosody information relating
to said
speech;
(iii) encoding speech information corresponding to said determined speech
element
information, said determined prosody information and timing information
relating thereto;
(iv) receiving and decoding said encoded speech information to obtain said
timing
information, said determined speech element information and said determined
prosody
information;
(v) comparing said decoded speech element information and prosody information
to a
database to select corresponding speech sound elements; and
(vi) assembling said selected speech sound elements to construct a facsimile
of said
speech.
7. The method of claim 6 further comprising the steps of:
acquiring an electronic representation of speech in an analog form; and
digitizing said analog representation of speech to obtain said digitized
speech for step
(i).
8. The method of claim 6 wherein step (v) further comprises comparing said
decoded
speech element information to a predefined database of undesired speech
characteristics and



-20-
selecting speech sound elements which reduce said undesired speech
characteristics is said
facsimile.
9. The method of claim 6 further comprising the step of receiving from a user
a word or
phrase of interest and search said determined speech element information and
said determined
prosody information to locate occurrences of said received word or phrase and
to identify said
locations to said user.

Description

Note: Descriptions are shown in the official language in which they were submitted.



CA 02251502 1998-10-26
DIGITAL SPEECH ACQUISITION, TRANSMISSION,
STORAGE AND SEARCH SYSTEM AND METHOD
Field Of The Invention
The present invention relates to digital speech systems and methods. More
particularly, the present invention relates to methods of acquiring,
transmitting, storing and
searching speech in a digital format.
Background Of The Invention
Much research and effort has been directed to speech recognition by computer
systems and/or to transmission of speech in digital formats. For example, IBM,
Dragonsoft
and others sell products for personal computers which allow a user to speak
computer
commands into a microphone and have the computer respond accordingly to the
spoken
commands. More advanced systems which have recently become available allow a
user to
dictate into the microphone and have the computer transcribe the spoken words
into written
text.
These systems operate by converting the analog signal produced by the
microphone into a digital signal. The computer system then analyses the
digital signals,
typically employing a technique called Hidden Markov Modeling (HMM), in an
attempt to
extract content and meaning from digitized speech. When employed with
telephony and
restricted ranges of vocabulary, recent HMM systems have been able to achieve
rates of
accuracy approaching 90%, or better.
However, HMM-based systems suffer from several disadvantages.
Specifically, HMM-based systems employ pattern matching and statistical models
to assign
words to the digitized speech. This can require significant amounts of
processing power and
such systems do not offer good accuracy when the vocabulary they must analyze
is not very
restricted. Further, these systems typically require configuration (training)
to be performed
by each user to learn the user's speech characteristics before a user can
utilize the system.
Even when trained, HMM-based systems can make errors. For example, in one test
of an


CA 02251502 1998-10-26
-2-
HMM-based system after a full training session, the phrase, "I am based in
Toronto" was
transcribed as "I embraced Toronto" and "Ian blazed Toronto". Further specific
training was
required before "based" could be recognized by the system.
It is also known to transmit or store voice data in digital form. For example,
a
compact disc can store a speech and telephone systems now commonly employ
digital links
over which the conversations are carried. More recently, voice conversations
have been
transmitted over digital packet networks and, it has recently become possible
to conduct
telephone conversations over the Internet. In the Internet systems, the user
speaks into a
microphone at a personal computer and the output of the microphone is
digitized, packetized
and transmitted through the Internet by the personal computer to a remote
computer where it
is received, converted back into an analog signal and output to the receiving
user.
However, digital voice storage and transmission systems also suffer from
disadvantages. One of the primary disadvantages is that, to obtain reasonable
quality voice
signals, the digitization process must have a relatively high sampling rate
(e.g. a minimum of
6,500 samples per second and preferably 8,000 or 10,000) and thus digitized
voice requires a
relatively high amount of bandwidth (e.g. about 1,500 bits per second) for
transmission. For
example, Voxware, Inc., 305 College Road East, Princeton, New Jersey, USA
sells a variety
of digital voice codecs. These codecs employ 8kHz sampling of a 4kHz speech
frequency
and require between 1260 and 6400 bps (bits per second) for transmission,
depending upon
the desired quality and robustness of the transmission.
An additional disadvantage is that, if such digital voice transmissions are to
be
stored for later access, which can be required for call logging and other
purposes, a large
amount of storage space can be required. These bandwidth and/or storage
requirements can
prevent voice transmission from being provided as a feature for Internet multi-
player games,
etc.


CA 02251502 1998-10-26
-3-
Yet another disadvantage with prior art digital speech systems is that they do
not provide an acceptable method of searching the speech for words or phrases
of interest. At
best, the prior art systems allow "overspeed" playback of the speech to allow
a user to listen
to the speech at a fast playback rate to attempt to locate portions of
interest in the speech
which can then be played back at the correct playback rate. At worst, these
systems require
that the user listen to the speech at correct playback rates to locate
portions of interest. These
search techniques leave much to be desired, especially if a long portion of
speech is to be
searched.
It is desired to have a system and method which allows relatively efficient
acquisition, transmission, searching and storage of speech information in a
digital form.
Summary Of The Invention
It is an object of the present invention to provide a novel system and method
of acquiring, transmitting, searching and/or storing speech information which
obviates or
mitigates at least one of the above-mentioned disadvantages of the prior art.
According to a first aspect of the present invention, there is provided a
digital
speech system, comprising:
speech element determination means operable on speech in a digital format to
determine speech element information from said speech;
speech prosody determination means operable on said speech to determine
prosody information from said speech;
encoding means to encode speech information comprising said determined
speech element information, said determined speech prosody information and
timing
information relating thereto in a digital form;
decoding means to decode said encoded speech information to obtain said
determined speech element information, said determined speech prosody
information and
said timing information;


CA 02251502 1998-10-26
-4-
comparison means to compare said determined speech element information
and said determined speech prosody information to a database of speech
elements to select
speech sound elements which correspond thereto; and
speech generating means to assemble said selected speech sound elements to
construct a facsimile of said speech.
Preferably, the system further comprises speech acquisition means to acquire
an analog electronic representation of the speech and digitization means to
convert the analog
representation of the speech to the digital format. Also preferably, the
system further
comprises an output means to output the facsimile to a user in an audible
manner.
According to another aspect of the present invention, there is provided a
method of acquiring and constructing digital speech, comprising the steps of:
examining digitized speech to determine speech element information relating
to said speech;
examining said digitized speech to determine prosody information relating to
said speech;
encoding speech information corresponding to said determined speech element
information, said determined prosody information and timing information
relating thereto;
receiving and decoding said encoded speech information to obtain said timing
information, said determined speech element information and said determined
prosody
information;
comparing said decoded speech element information and prosody information
to a database to select corresponding speech sound elements; and
assembling said selected speech sound elements to construct a facsimile of
said speech.
Brief Description Of The Drawings


CA 02251502 1998-10-26
-5-
Preferred embodiments of the present invention will now be described, by way
of example only, with reference to the attached Figures, wherein:
Figure 1 shows a block diagram representation of a digital speech system in
accordance with an embodiment of the present invention;
Figure 2 shows a flowchart of a digital speech acquisition process of Figure
1;
Figure 3 shows a flowchart of the speech generation process of Figure 1;
Figure 4 shows a schematic representation of a telephone-like communication
system operating between two users connected to the Internet in accordance
with the present
invention;
Figure 5 shows a schematic representation of a call center employing an
embodiment of the present invention; and
Figure 6 shows a flowchart of a speech generation process in accordance with
another embodiment of the present invention.
Detailed Description Of The Invention
A digital speech system, in accordance with an embodiment of the present
invention, is indicated generally at 10 in Figure 1. System 10 includes a
transducer 12, such
as a microphone or other suitable means to acquire speech in electronic form,
a speech
acquisition portion 14, as described below, a speech generation portion 16,
also described
below, and a speech output 18, such as a loudspeaker or other suitable means
for presenting
the constructed speech facsimile to a user or subsequent system. Connection 20
between
speech acquisition portion 14 and speech generation portion 16 can be a
connection, such as a
data network, or can be a storage and retrieval system, such as a digital mass
storage device
which allows acquired speech to be stored and to subsequently be provided to
speech
generation portion 16.
As shown is Figure 2, speech acquisition portion 14 includes five principal
functional blocks. In block 24, portion 14 acquires, via transducer 12, speech
to be processed
and in block 28 the acquired speech is digitized at an appropriate sampling
rate. In block 32,


CA 02251502 1998-10-26
-6-
the digitized speech is analyzed to determine speech element information from
the acquired
speech and in block 36 prosody information is determined from the acquired
speech. As will
be apparent to those of skill in the art, the functions of blocks 32 and 36
can be performed in
a serial manner, as shown in the Figure, or can be performed in parallel if
desired. In either
case, the time information relating the determined speech elements to the
determined prosody
is part of the determined information. Once the speech element and prosody
information is
determined in blocks 32 and 36, this information is encoded by block 40 and
the resulting
encoded speech information can be transmitted and/or stored by connection 20,
as indicated
at block 44.
The acquisition and digitization of the speech at blocks 24 and 28 can be
accomplished in any suitable manner as will occur to those of skill in the art
and many
conventional techniques to accomplish this are known. For example, in the
environment of
personal computers, it is not uncommon that speech can be acquired and
digitized with an
inexpensive microphone connected to an input port on a sound card, such as a
SoundBlaster
or compatible card. It is also contemplated that the functions of blocks 24
and 28 can be
performed in separately and/or at a different time. For example, the acquired
and digitized
speech can be provided to speech system 10 from a previously recorded CD ROM,
etc.
The speech elements referred to with respect to block 32 can be any
representative elements, such as allophones, diphones, phonemes, etc., as will
occur to those
of skill in the art. In a presently preferred embodiment of the invention,
phonemes are
determined from the acquired speech. Phonemes are a formal representation of
phones (the
distinct sounds in speech) and include vowels, semivowels, dipthongs and
constants.
Different languages and/or accents have different phonemes and, depending upon
the
phoneme definition standard employed and its implementation, there are between
about forty
and about sixty phonemes in English.


CA 02251502 1998-10-26
_7_
The analysis of speech, via digital processing, to determine phonemes is
known and is, for example, discussed in, "Fundamentals of Speech Recognition:
A Short
Course", by Dr. Joseph Picone, Institute for Signal and Information
Processing, Mississippi
State University. This reference, the contents of which are incorporated
herein by reference,
is available at http://www.isip.msstate.edu/resources/index.html#journals.
Prosody, which is the information determined in block 36, represents pitch,
timing and other information associated with speech. An example of prosody is
the rising
pitch at the end of a sentence to represent a question. Other examples of
prosody include
pauses to represent the ends of sentences, etc. Again, the determination of
prosody
information from speech is known. The paper, "Automatic Prosodic Analysis for
Computer
Aided Pronunciation Teaching", Paul Christopher Bagshaw, available at
http://www.cstr.ed.ac.uk/pubs_by-year.html#theses provides a good description
of the
determination of prosody and the contents of this reference are also
incorporated herein by
reference.
Conventional speech element determination systems, for example phoneme
determination systems, return a matrix for all possible phonemes indicating
the probability that
each particular phoneme was present. Word recognizer systems search for the
matrix for the
letter combinations that have the highest probability of being a word. Unlike
these conventional
speech element determination systems, the present invention can further employ
these
probabilities as a reference to tables of allophones, looking for sound
similarities, overlap in
sounds, and such, to assist in finding the most probable sound match, rather
than the most
probable word match. Matching on sounds, rather than words, results in a much
better
approximation of the way speech works, especially the way people mix and blend
sounds in real
speech.
Once the speech information, comprising the speech element and prosody
information, has been determined in blocks 32 and 36, it is encoded at block
40. The


CA 02251502 1998-10-26
_g_
information is encoded so that it can be effectively represented in a digital
form. For
example, if system 20 has sixty phonemes defined for it, for the English
language, a set of
sixty unique symbols can be defined to represent these phonemes and this
symbol set can be
represented by six bits of information. As will be apparent, six bits are much
less than the
many bytes of information which would be required for a digitized sample of
the spoken
phoneme. One encoding scheme that can be employed is to use the Unicode values
for the
International Phonetic Alphabet to represent the phonetic information and this
is the presently
preferred encoding scheme for the determined speech elements.
Prosody information is also encoded in a digital form, for example as numbers
representing the duration of speech elements and pitch changes. Again, the
encoding of the
determined prosody information can be accomplished with a relatively small
symbol set
and/or in a size efficient manner, compared to digitized samples of speech. As
mentioned
above, in the encoding, the time relationship between the speech elements and
prosody
information is maintained.
The information encoded at block 40 is stored or transmitted, through
connection 20, at block 44.
The present invention allows a relatively low bandwidth method of
transmitting speech information and/or a relatively efficient storage method
for speech. As
discussed above, prior art systems of which the inventor is aware either
digitize speech for
transmission and/or storage and reconvert that speech to an analog form for
use, or perform a
speech to text conversion. In contrast, the present invention determines
information about the
speech, the information being stored andlor transmitted in a manner that
maintains the
information without maintaining the actual speech. Specifically, the encoded
determined
information can be employed to construct a facsimile representation of the
original speech or
to otherwise interact and access the information of the original speech.


CA 02251502 1998-10-26
-9-
Construction of a facsimile representation of the original speech from the
stored or transmitted encoded information can be accomplished in speech
generation portion
16, as shown in Figure 3. As illustrated, speech generation portion 16
includes four principal
functional blocks. In functional block 52 the encoded information is received
at or retrieved
by the speech generation process. In block 56, the encoded information is
decoded to obtain
the speech element information and prosody information. Next, in block 60, the
speech
information and related prosody information is compared to a database of
speech sound
elements available to the speech generator. Statistical modeling, or any other
suitable
selection process, is then employed to select appropriate available speech
sound elements. In
block 64, the selected speech sound elements are assembled in accordance with
the prosody
information and output to the user as a constructed facsimile of the original
speech. As will
be apparent to those of skill in the art, the speech sound elements can
comprise digitized
speech samples and/or synthetic speech elements, such as those produced by a
vocoder or by
FM synthesis. In the latter case, the selected speech sound elements comprise
appropriate
instructions for the vocoder or FM synthesis system to create the desired
sounds.
As will be apparent to those of skill in the art, the determination of speech
elements and prosody information and the selection of speech sound elements
for
construction of facsimiles of the original speech can be less than completely
accurate.
However, as is well known, most individuals do not always speak in a perfectly
correct
manner and yet their speech is still understood by others. This understanding
of 'flawed'
speech is possible due to the redundancy of information in speech, including
the context of
the speech and the relationship between the speech elements and the prosody.
The present
invention takes advantage of the human ability to understand even flawed
speech without
undue effort and it has been determined by the present inventor that
relatively many errors in
the overall process of constructing a facsimile of the original speech are
easily tolerated by
most humans.


CA 02251502 1998-10-26
- 10-
Phoneme determination, and speech determination in general, are statistical in
nature and prone to errors and 'almost right' recognition. The present
invention preferably
takes advantage of the relationship in speech between phonemes and allophones
to provide
acceptable playback of speech. As mentioned above, English has between about
forty and
about sixty phonemes but there are many ways of pronouncing these phonemes,
since they
tend to be modified by dialect, preceding and following sounds, emotion, non-
word speech
components such as the pitch rise for questions, etc. In constructing a
facsimile of the
original speech, the present invention can use the speech relationship between
phonemes and
allophones to select the most probable phoneme matches by also considering the
list of
allophones. As the top matches tend to be similar, for example "f' and "v"
might be the top
recognition for the first phone in "fire", an allophone which is a close match
to both can be
found by examining the allophone list against the phoneme probabilities.
Selecting the best
phoneme match based on both the phoneme probability and the allophone
probability allows
the present invention to obtain matches very close to the original sound.
One use of the present invention, which is illustrated schematically in Figure
4, allows transmission of the encoded speech information to a remote location
via a relatively
low bandwidth data network. For example, a conversation can be conducted by
two or more
individuals over a packet network 100, such as the Internet, with each party
employing a
computer system 104a, 104b such as a general purpose IBM compatible personal
computer or
a dedicated special purpose computer-based system. Computer system 104
operates to
acquire the user's speech via a microphone 108a, 108b and to determine and
encode the
speech elements and prosody information as acquisition portion 14 of system
10. The
encoded information is then transmitted through the Internet 100 to the other
user. Further,
each user's computer system 104 receives the encoded speech information from
the other user
or users and constructs a facsimile of the original speech as the speech
generation portion 16
of system 10, as described above, which is then output to the user via a
loudspeaker 112,
headset or other suitable output device.


CA 02251502 1998-10-26
-11-
The relatively low bandwidth required to transmit the encoded speech
information, which can also be compressed prior to transmission and de-
compressed after
reception if desired, allows the present invention to be employed to permit
players of online
games to have spoken communication with other players over the limited
bandwidth of their
connections to the Internet over which the other game-related information must
also be
transmitted.
In addition to use for bandwidth efficient communications, the present
invention can provide a number of other features, advantages and uses. For
example, in the
above-mentioned online gaming uses, the speech generation portion 16 at each
user's
computer system can employ digitized samples of 'celebrity' voices as the
speech sound
elements. It is contemplated that a user can select a Darth Vader-like voice
to be assigned to
a first user and a HAL 9000-like voice to be assigned to a second user.
Further, it is also
contemplated that the speech generation portion 16 can include a predefined
selection of
celebrity of voices, and each first user can specify which of the available
voices is to be used
to reconstruct the first user's speech in the second user's speech generation
portion 16.
The present invention can also be employed to make more efficient use of
systems wherein bandwidth is expensive. For example, many corporations now
operate call
centers which, for labor cost and other reasons, are often located in
locations which are
distant from the customers who access the call center. While the final
financial analysis still
supports such distant call centers, much of the otherwise potential savings
can be absorbed by
the necessity that the corporation provide and pay for toll free access to the
call center for the
customers. While a call center would still have to fund toll free access, with
the present
invention it is possible to significantly reduce the total required bandwidth
to reduce the
corporation's costs. In particular, as shown in Figure 5, the corporation can
provide sub-
centers 160 which can be accessed locally by customers 164. Sub-centers 160,
which can be
personal computer systems, minicomputers, special purpose devices or any other
suitable
system as will occur to those of skill in the art, comprise the above-
described speech


CA 02251502 1998-10-26
-12-
acquisition portion 14 and speech generation portion 16 of system 10. Sub-
centers 160
acquire, determine, encode and transmit speech information from customers 164
calling the
sub-center 160 to call center 168 over a suitable communications link 166.
Communications
link 166 can be a leased data connection, a packet network, satellite link or
any other suitable
data communications link, as will occur to those of skill in the art.
The transmitted speech information is received at a suitable call center
computer system 168 at the call center. Call center computer system 168 can be
personal
computer systems, minicomputers, special purpose devices or any other suitable
system as
will occur to those of skill in the art, and also comprises the above-
described speech
acquisition portion 14 and speech generation portion 16 of system 10. When
call center
computer system 168 receives the transmitted speech information, speech
generation portion
16 constructs a facsimile of the customer's original speech. The attendant 172
at the call
center then responds to the inquiry and the attendant's speech is processed by
the speech
acquisition portion 14 of call center computer system 168 to determine, encode
and transmit,
over link 166, speech information to the appropriate sub-center 160 where a
construction of a
facsimile of the attendant's speech is created and provided to the customer
164 by sub-center
160.
In this manner, less bandwidth is required for communications link 166
between the sub-center 160 and call center computer system 168, thus reducing
the
corporation's costs. As is also indicated in the Figure, if desired and if
sufficient processing
capacity exists at sub-center 160, multiple calls can be simultaneously
processed by sub-
center 160 and speech information between sub-center 160 and call center
computer system
168 can be appropriately multiplexed over the communications link 166.
In addition to reducing bandwidth requirements or making more efficient use
of available bandwidth, the present invention also provides a process whereby
space efficient
storage of speech can be accomplished. In this specific case of a call center,
every


CA 02251502 1998-10-26
-13-
conversation between customers 164 and attendants 174 can be stored, in
encoded speech
information form, and requires less storage capacity than other conventional
methods.
Further, as discussed below, speech stored with the present invention can be
effectively
searched for the occurrence of words or phrases of interest.
As an additional advantage of the present invention, speech generation
portions 16 employed at sub-centers 160 can employ a preferred voice. For
example, the
corporation for which the call center is operated can have a celebrity
corporate spokesperson
and the celebrity's voice can be employed in the sub-center speech generation
portions 16.
An additional feature of the present invention is that it can be used to alter
or
remove cultural or regional accents and speech features of the speakers. One
of the
disadvantages with operating call centers distal customers is that accents and
other speech
features of the call center attendants can emphasize or otherwise interfere
with the
relationship between the attendant and the customer. For example, a customer
from the US
mid-west may be uncomfortable in speaking with an attendant in a call center
in Ireland, who
has a noticeable Irish accent. In the past, this has necessitated that call
centers attempt to hire
or train attendants without strong accents or other undesired identifying
speech
characteristics.
Figure 6 shows the principal functional blocks of a speech generation process
in accordance with another embodiment of the present invention. In this
embodiment, as the
speech generation process constructs a facsimile of the original speech from
determined
speech elements and prosody, it is possible to identify and reduce, or even
eliminate, accents
and other undesired identifying speech characteristics. The speech generation
process of
Figure 6 is similar to that of Figure 3, discussed above. In block 200,
encoded speech
information is received via a communications link or is retrieved from a
storage device. In
block 204, the encoded information is decoded to obtain the speech elements
and prosody
information. In block 208, the decoded information is examined, using
statistical models,


CA 02251502 1998-10-26
- 14-
pattern matching or other appropriate techniques, to recognize undesired
speech
characteristics and to substitute information into the speech element
information and prosody
information to reduce or remove those identified undesired characteristics to
form modified
speech information. In block 212, the modified speech information is used to
select speech
sound elements and in block 216 a constructed modified facsimile of the
original speech is
output to the user.
As will be apparent to those of skill in the art, the recognition and
substitution
process to remove undesired speech aspects can be performed in the speech
acquisition
portion 14, prior to transmission or storage of the speech information, or in
the speech
generation portion 16, as discussed above.
Another feature of the present invention is the ability to facilitate the
location
of desired phrases or words in speech. In many applications it is desired to
have the ability to
search stored speech for the occurrence of particular words or phrases. For
example, an
attorney may wish to locate a phrase in an audio recording of the testimony of
a witness,
which can be several hours in length. Presently, such a search can be
conducted only by
either manually or automatically transcribing the testimony to text and then
performing
conventional text searching operations or by listening to the testimony
recording. Neither of
these approaches is satisfactory in many circumstances, with manual
transcription being
expensive, automatic transcription being subject to errors and listening to
the recording being
time consuming.
With the present invention, the speech to be searched can be processed, with a
speech acquisition process similar to that of Figure 3, to determine encoded
speech
information. Speech can either be stored only in encoded speech element form,
such as for
the call center call logging discussed above, or can be stored in both an
encoded speech
element form and an analog or digitized form. In either case, the determined
speech elements
are searched to locate words or phrases of interest.


CA 02251502 1998-10-26
- 15 -
When it is desired to locate the occurrence of a word or phrase of interest,
the
word or phrase is processed by the speech acquisition process to determine
encoded speech
information for the word or phrase. The search is then performed on the speech
by
comparing the stored determined speech element and prosody information with
the speech
element and prosody information for the search word or phrase. A suitable
pattern matching
or statistical algorithm can be employed to locate high probability matches
between the
speech information and the search information. For applications where the
speech has only
been stored as encoded speech information, such as the above-mentioned call
logging for a
call center, the identified matches are provided to a speech generation
process, as described
above, to let the searcher hear a construction of the appropriate portions of
the speech. For
applications where the original speech, in analog or digitized form, has also
been stored, the
time information related to the located matches can be provided so that the
user can directly
access the appropriate times in the original speech. As will be apparent to
those of skill in the
art, the above-mentioned processes for removing accents and undesired speech
characteristics
can also be employed with this search process to augment the ability to locate
phrases or
words when the searcher and the searchee have different speech
characteristics. As will also
be apparent, the stored determined speech information can be indexed, in any
suitable manner
as will occur to those of skill in the art, to enhance the speed at which
searches of large
datasets can be performed.
The present invention can also be employed in voice transcription systems.
For example, a user can provide dictation which is processed to obtain encoded
speech
information, as described above, which can be stored in an efficient manner.
The encoded
speech can be used to construct a facsimile of the original speech to the
user, thus providing a
spoken memo pad-like application, or can be subsequently transcribed to text.
In the former
case, digitized samples of the voice of the user can be user to reconstruct
the speech, if
desired. In the latter case, the transcription can be performed by humans, in
a conventional


CA 02251502 1998-10-26
-16-
manner, or can be performed by speech to text transcriptions systems, such as
the above-
mentioned Dragonsoft and other systems.
One of the perceived advantages of the present invention, when used with
speech to text transcription systems, is that as the speech information is
stored, it can be
examined in a non-real time manner. This allows multiple passes to be
performed by the
speech to text transcription system. Also, as mentioned above, accents or
other speech
characteristics which can confuse a speech to text transcription system, can
be removed or
mitigated by the present invention and this can reduce or eliminate lengthy
training of the
transcription system by the user.
As will be apparent to those of skill in the art, the present invention can
also be
employed with voice messaging type applications. For example, encoded speech
information
can be emailed to a recipient, who will then provide the speech information to
a speech
generation process, which can be a plug-in component for the email program, to
hear a
constructed facsimile of the original speech.
In another aspect, the present invention can provide voice recognition of
users.
In this embodiment, several base samples of the user's speech are processed to
obtain speech
element and prosody base information for that user and this information is
stored in a
database. To subsequently identify that user, the user is requested to speak
to provide a test
sample of their speech to the system. The test sample is processed to obtain
speech element
and prosody information from the test sample and the obtained information is
compared to
the base information previously determined for the user. A statistical
comparison is then
performed between the information for the base samples and the test sample to
determine the
degree of correspondence therebetween. Unlike prior art voice printing
systems, the present
invention is less sensitive to differences in the microphone and environment
with which the
base and the test samples are acquired and does not require that the user
repeat the particular
base samples originally provided to the system.


CA 02251502 1998-10-26
-17-
The above-described embodiments of the invention are intended to be
examples of the present invention and alterations and modifications may be
effected thereto,
by those of skill in the art, without departing from the scope of the
invention which is defined
solely by the claims appended hereto.

Representative Drawing
A single figure which represents the drawing illustrating the invention.
Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee  and Payment History  should be consulted.

Administrative Status

Title Date
Forecasted Issue Date Unavailable
(22) Filed 1998-10-26
(41) Open to Public Inspection 2000-04-26
Dead Application 2002-10-28

Abandonment History

Abandonment Date Reason Reinstatement Date
2001-10-26 FAILURE TO PAY APPLICATION MAINTENANCE FEE

Payment History

Fee Type Anniversary Year Due Date Amount Paid Paid Date
Application Fee $150.00 1998-10-26
Registration of a document - section 124 $100.00 1999-06-08
Maintenance Fee - Application - New Act 2 2000-10-26 $50.00 2000-10-23
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
LANCASTER EQUITIES LIMITED
Past Owners on Record
LUTZ, MARC
VANGE, MARK
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Representative Drawing 2000-04-18 1 6
Cover Page 2000-04-18 1 32
Abstract 1998-10-26 1 15
Description 1998-10-26 17 792
Claims 1998-10-26 3 88
Drawings 1998-10-26 4 56
Assignment 1998-10-26 3 121
Correspondence 1998-12-15 2 80
Correspondence 1998-12-08 1 26
Assignment 1998-10-26 2 78
Assignment 1999-06-08 4 146