Language selection

Search

Patent 2409488 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent: (11) CA 2409488
(54) English Title: METHOD AND SYSTEM FOR REAL-TIME SPEECH RECOGNITION
(54) French Title: METHODE ET SYSTEME DE RECONNAISSANCE DE LA PAROLE EN TEMPS REEL
Status: Expired and beyond the Period of Reversal
Bibliographic Data
(51) International Patent Classification (IPC):
  • G10L 15/02 (2006.01)
  • G10L 15/12 (2006.01)
  • G10L 15/14 (2006.01)
  • G10L 15/34 (2013.01)
(72) Inventors :
  • DESTREZ, NICOLAS (Switzerland)
  • DUFAUX, ALAIN (Switzerland)
  • BRENNAN, ROBERT (Canada)
  • CORNU, ETIENNE (Canada)
  • SHEIKHZADEH-NADJAR, HAMID (Canada)
(73) Owners :
  • SEMICONDUCTOR COMPONENTS INDUSTRIES, LLC
(71) Applicants :
  • SEMICONDUCTOR COMPONENTS INDUSTRIES, LLC (United States of America)
(74) Agent: GOWLING WLG (CANADA) LLP
(74) Associate agent:
(45) Issued: 2007-05-29
(22) Filed Date: 2002-10-22
(41) Open to Public Inspection: 2003-04-22
Examination requested: 2002-10-22
Availability of licence: N/A
Dedicated to the Public: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): No

(30) Application Priority Data:
Application No. Country/Territory Date
2,359,544 (Canada) 2001-10-22

Abstracts

English Abstract

Method and system for real-time speech recognition is provided. The speech algorithm runs on a platform having an input-output processor and a plurality of processor units. The processor units operate substantially in parallel or sequentially to perform feature extraction and pattern matching. While the input-output processor creates a frame, the processor units execute the feature extraction and the pattern matching. Shared memory is provided for supporting the parallel operation.


French Abstract

Méthode et système de reconnaissance de la parole en temps réel. L'algorithme de parole défile sur une plate-forme dotée d'un processeur d'entrée-sortie et de plusieurs unités de processeur. Les unités de processeur fonctionnent essentiellement en parallèle ou de façon séquentielle pour effectuer l'extraction de la caractéristique et l'appariement de formes. Pendant que le processeur d'entrée-sortie crée une trame, les unités de processeur exécutent l'extraction de la caractéristique et l'appariement de formes. De la mémoire partagée est prévue pour permettre l'opération en parallèle.

Claims

Note: Claims are shown in the official language in which they were submitted.


16
What is claimed is:
1. A system for recognizing speech in real-time, the system comprising:
an input processor for receiving samples of speech and organizing the
samples into a frame; and
at least two programmable processor units having functionality of feature
extraction of the frame based on Oversampled Filterbank Analysis (OFBA) and
pattern matching, the functionality being divided and assigned to each
processor unit, the feature extraction including a bin energy factorization
and
an additional signal processing for the feature extraction,
the at least two processor units including:
a first processor unit for performing the OFBA and the bin energy
factorization, and
one or more second processor units for performing the additional
signal processing and the pattern matching,
the processor units operating sequentially or substantially in parallel, the
processor unit and the input processor operating sequentially or substantially
in
parallel.
2. The system as claimed claim 1 wherein the input processor and the
processor units operate in parallel, and the one or more second processor
units
perform the additional signal processing of determining FFT band energies,
energy bins and Mel Frequency Cepstrum Coefficient (MFCC).
3. The system as claimed in claim 2 further comprising a memory which
the input processor and the processor units share.

17
4. The system as claimed in claim 3, wherein the first processor unit
performs the bin energy factorization using vector multiplication which
multiplies
the FFT band energies by a vector, and the second processor unit executes the
vector multiplication when the FFT band energies are calculated.
5. The system as claimed in claim 4, wherein the first processor unit stores
the results of the vector multiplication in a first buffer, the one or more
second
processor units subtracting the data in the first buffer from an original FFT
band
energy, and storing the subtraction result in a second buffer.
6. The system as claimed in claim 5, wherein the one or more second
processor units assign the values in the first and second buffers using an
index
table mapping the FFT bands to the energy bins.
7. The system as claimed in claim 5, wherein the one or more second
processor units calculate Discrete Cosine Transform (DCT) based on the
results of the mapping.
8. The system as claimed in claim 5, wherein the first processor unit
calculates Discrete Cosine Transform (DCT).
9. The system as claimed in claim 3, wherein at least one of the second
processor units performs the pattern matching such that the second processor
unit compares the input against pre-stored templates using a pattern matching
technique.
10. The system as claimed in claim 3, wherein at least one of the second
processor units performs the pattem matching using Hidden Markov Models
(HMM).

18
11. The system as claimed in claim 3, wherein at least one of the second
processor units performs the pattern matching using Dynamic Time Warping
(DTW).
12. The system as claimed in claim 11, wherein the pattern matching
technique includes one or more general pattern recognition techniques
including artificial neural networks, Baysian classification, and template
matching using Euclidian or other distances.
13. The system as claimed in claim 3, wherein the one or more second
processor units perform endpoint detection when calculating the FFT band
energies.
14. The system as claimed in claim 1, wherein the input processor applies a
pre-emphasis filter to each sample.
15. A method of recognizing speech in real-time, the method comprising the
steps of:
receiving samples of speech and creating a frame;
extracting a feature of the frame by performing Oversampled Filterbank
Analysis (OFBA); and
performing pattern matching based on the feature,
the creating step, the extracting step and the performing step being
implemented substantially in parallel,
the extracting step including a step of performing the OFBA and a bin
energy factorization, and a step of performing an additional signal processing
for the feature extraction,

19
the step of performing the OFBA and a bin energy factorization and the
step of performing an additional signal processing being implemented
substantially in parallel.
16. A method of claim 15, wherein the extracting step includes the step of:
performing Fast Fourier Transform (FFT); and
calculating Discrete Cosine Transform (DCT) based on FFT band
energies and generating a Mel Frequency Cepstrum Coefficient (MFCC) based
on the DCT,
the step of performing a bin energy factorization utilizing vector
multiplication which multiples the FFT band energies by a vector,
the step of performing FFT and the step of performing a bin energy
factorization, and the step of calculating DCT and generating a MFCC being
implemented substantially in parallel to the creating step for a next frame.
17. A method as claimed in claim 16, wherein the factorization step includes
a step of storing the results of the vector multiplication in a first buffer
and a
step of subtracting the data on the first buffer from an original FFT band
energy,
and a step of storing the subtraction resuit in a second buffer.
18. A method as claimed in claim 17 wherein the factorization step includes
a step of assigning values in the first and second buffers to the energy bins
using an index table mapping the FFT bands to the energy bins.

Description

Note: Descriptions are shown in the official language in which they were submitted.


CA 02409488 2006-01-09
1
Method and system for Real-Time Speech Recognition
FIELD OF THE INVENTION:
The present invention relates to speech recognition, and more particularly
a method and system for speech recognition substantially in real time.
BACKGROUND OF THE INVENTION:
Today, speech recognition technology relies on a standard set of
algorithms that are known to produce good results. When implemented on
computer systems, these algorithms require a certain amount of storage and
involve a relatively large number of complex calculations. Because of these
requirements, real-time speech recognition systems based on these algorithms
have so far not been successfully deployed in low-resource environments (i.e.
low power consumption, low memory usage, low computation load and
complexity, low processing delay).
An effort is ongoing to find ways to design speech recognition systems
with reduced resource requirements. For example, Deligne et al. describe a
continuous speech recognition system suitable for processors running at a
minimum of 50 MIPS and having at least 1 Mbytes of memory ("Low-Resource
Speech Recognition of 500-Word Vocabularies", Proceedings of Eurospeech
2001, pp. 1820-1832), and Y. Gong and U. H. Kao describe a system running on
a 30 MHz DSP with 64K words of memory ("Implementing a high accuracy
speaker-independent continuous speech recognizer on a fixed DSP",
Proceedings of the ICASSP 2000, pp. 3686-3689). J. Foks presents a voice
command system running on a 2.5 MHz CR16B processor ("Implementation of
Speech Recognition on CR16B CompactRisc", Proceedings of the ICSPAT
2000).
Some algorithms have been developed that require fewer resources and
are better adapted for low-resource environments. However, these algorithms

CA 02409488 2006-01-09
2
are simpler in scope and usually designed for very specific situations. In
addition, the algorithms have only allowed marginal improvements in power
consumption over the algorithms described above and are still not suitable for
- uftra-low resource environnients.
Another problem concerns speech recognition in noisy environments. In
these situations, special algorithms have to be applied. The algorithms
perform
voice activity detection, noise reduction or speech enhancement in order to
improve recognition accuracy. These algorithms also require complex
calculations and therefore add a lot of overhead to the system, making it even
more difficult to deploy robust speech recognition in low-resource
environments.
Therefore, it is desirable to provide a speech recognition method and
system to provide a high quality output, which can be deployed in low resource
environments.
SUMMARY OF THE INVENTION:
It is an object of the present invention to provide a novel method and
system for speech recognition, which obviates or mitigates at least one of the
disadvantages of existing methods and systems.
In accordance with an aspect of the present invention, there is provided a
system for recognizing speech in real-time, which includes: an input processor
for receiving samples of speech and organizing the samples into a frame; and
at
least two programmable processor units having functionality of feature
extraction
of the frame based on Oversampled Filterbank Analysis (OFBA) and pattern
matching. The functionality is divided and assigned to each processor unit.
The
feature extraction includes a bin energy factorization and an additional
signal
processing for the feature extraction. The at least two processor units
includes:
a first processor unit for performing the OFBA and the bin energy
factorization,
and one or more second processor units for performing the additional signal
processing and the pattern matching. The processor units operate sequentially
or substantially in parallel. The processor unit and the input processor
operate
sequentially or substantially in parallel.

CA 02409488 2006-01-09
3
In accordance with a further aspect of the present invention, there is
provided a method of recognizing speech in real-time, which includes:
receiving
samples of speech and creating a frame; extracting a feature of the frame by
- - - peifo~r-rdi'ng -O-versampCed-FiTferbank AnaTysis (OFSAj an3 perTorming
paern- - -
matching based on the feature. The creating step, the extracting step and the
performing step are implemented substantially in parallel. The extracting step
includes a step of performing the OFBA and a bin .energy factorization, and a
step of performing an additional signal processing for the feature extraction.
The
step of performing the OFBA and a bin energy factorization and the step of
performing an additional signal processing are implemented substantially in
parallel.
Other aspects and features of the present invention will be readily
apparent to those skilled in the art from a review of the following detailed
description of preferred. embodiments in conjunction with the accompanying
drawings.

CA 02409488 2006-01-09
3A 5
BRIEF DESCRIPTION OF THE DRAWINGS:
The invention will be further understood from the following description with
reference to the drawings in which:
Figure 1 is a block diagram a speech processing system in accordance
with an embodiment of the present invention;
Figure 2 is a block diagram showing one example of the platform of the
speech recognition system of Figure 1;
Figure 3 is a block diagram showing another example of the platform of
the speech recognition system. of Figure 1;
Figure 4 is a flow diagram showing the operation of the DSP system of
Figure 3 for voice command algorithm operations;
Figure 5 is a schematic diagram showing how the feature extraction
process is performed on the DSP system of Figure 3;
Figure 6 is a schematic diagram showing how the triangular bands are
spread along the frequency axis;
Figure 7 is a data flow showing the energy bin calculation process; and
Figure 8 is a schematic. diagram showing how the DSP core of Figure 3
assigns the values in the two buffers to the L energy bins.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT(S):
Figure 1 shows a speech processing system in accordance with an

CA 02409488 2002-10-22
4
embodiment of the present invention. The speech processing system 1000 of
Figure 1 includes a speech recognition system 100 and a host 102. The speech
recognition system 100 has the functionality of feature extraction for
extracting
features from speech uttered by a user and the functionality of pattern
matching
for matching the extracted features against a set of references (i.e., pre-
stored
templates) in order to determine what a user said.
For feature extraction, the Mel Frequency Cepstrum Coefficients (MFCC)
algorithm may be used. For pattern matching, the Hidden Markov Models
(HMM) algorithm, the Dynamic Time Warping algorithm (DTW), or artificial
neural networks may be used.
Speech signals are sent from the host 102 to the speech recognition
system 100 for the feature extraction. The speech recognition system 100
executes the feature extraction, and then executes the speech recognition
using
the extracted features.
Based on the extracted features, training is executed off-line. The models
obtained in the training are used during real-time speech recognition in the
speech recognition system 100.
The training may be executed on the speech recognition system 100 or
the host 102 using the extracted features.
The speech processing system 1000 may operate as a voice command
system for executing commands of a user. When the speech processing
system 1000 operates in this mode, the speech recognition system 100
recognizes valid commands and one or more devices (not shown) execute
functions corresponding to the recognition result.
The speech recognition system 100 is implemented on a platform, such
as digital signal processor (DSP), which is suitable for use in a low resource
environment (i.e. low power consumption, low memory usage, low computation
load and complexity, low processing delay).
Figure 2 is a block diagram showing one example of the platform of the
speech recognition system of Figure 1. The platform 100A of Figure 2 includes
a plurality of programmable processor units, an input-output processor (IOP) 4

CA 02409488 2002-10-22
and a memory 5. Figure 2 shows three programmable processor units 1 to 3.
However, the platform 100A may have more than three programmable
processor units.
The processor units 1 to 3 directly or indirectly communicate with each
5 other. The processor units 1 to 3 and the input-output unit 4 operate in
parallel
or sequentially. The processor unit 1, 2 or 3 may be microcoded to implement a
weighted overlap-add (WOLA) filterbank as described below.
The processor units 1 to 3 perform feature extraction and pattern
recognition. The steps of performing the feature extraction and pattern
recognition are divided and assigned to the processor units I to 3 so that the
processing undertaken by the processor units 1 to 3 can be done in parallel.
The
processor units 1, 2 or 3 may have the functionality of the training.
The input-output processor 4 manages incoming samples and outgoing
data. A front-end processor (e.g. 102 of Figure 1) includes an Analog/Digital
(A/D) converter (not shown) that samples and digitizes incoming speech
signals.
The input-output processor 4 takes the incoming digitized speech signal from
the front-end processor and applies a pre-emphasis window to the speech
signal.
The shared memory 5 is provided for the processor units I to 3 and the
input-output processor 4 to enhance inter-processor communication. On
performing the feature extraction and pattern matching, the processor units 1
to
3 do not need to move their calculation results to each other, but rather need
only store them in the shared memory 5. The input-output processor 4 outputs
the frame data to the memory 5 and receives the outputs of the processor units
I to 4.
In addition to the speech input, the platform 100A may have an interface
through which to communicate results with the outside world.
Figure 3 is a block diagram showing another example of the platform of
the speech recognition system 100 of Figure 1. The speech recognition platform
(referred to as DSP system hereinafter) 100B of Figure 3 includes a
microcodeable weighted overlap-add (WOLA) filterbank 10, a 16 bit fixed point
DSP core 20, the input-output processor (IOP) 30 and Random Access Memory

CA 02409488 2006-01-09
6
(RAM) 40. The WOLA filterbank 10, the DSP core 20 and the input-output
processor 30 operate in parallel.
The input-output processor 30 is similar to the input-output processor 4 of
Figure 2. The input-output processor 30 obtains digitalized speech signals and
creates a frame. The result is output to a first-in-first-out buffer (FIFO)
45.
The WOLA filterbank 10 includes a WOLA filterbank co-processor 14,
and data memory 16. The WOLA filterbank 10 may operate as an oversampled
WOLA filterbank as described in U.S. Patent No. 6,236,731 and U.S. Patent No.
6,240,192. In the speech recognition system 100, the WOLA filterbank 10 is
microcoded (12) for performing windowing operations, the Oversampled
Filterbank Analysis (OFBA) using the Weighted Overlap-Add (WOLA) method
and the Fast Fourier Transform (FFT), and vector multiplications, which are
included in the oversampled WOLA process.
The DSP core 20 includes timers 22, an address generation module 24, a
program control unit 26, and a data Arithmetic and Logical Unit (ALU).28. The
DSP core 20 enables the system the implementation of time-domain algorithms
that are not directly accomplishable by the WOLA co-processor 14 thereby
adding a degree of re-configurability.
The RAM 40 includes a data memory 41 for storing data for the DSP core
20 and the WOLA filterbank 10, and a program memory space 42 for storing the
program for the DSP core 20.
The basic concept of the DSP system 100B is disclosed in U.S. patent
No. 6,236,731, U.S. patent No. 6,240,192, and "A Flexible Filterbank Structure
for Extensive Signal Manipulations in Digital Hearing Aids" by R. Brennan and
T.
Schneider, Proc. IEEE lnt. Symp. Circuits and Systems, pp. 569-572,1998
The DSP system 100B communicates with the outside world through the
UART (serial port) 52, general-purpose input/output (I/O) pins 54 and an
interface (not shown) dedicated to the speech signal coming from a mixed-
signal
(analog/digital) chip (not shown).
The input/output pins 54 can be used for performing actions as a result of
recognizing commands, and for receiving addition input signals, regardless of
a

CA 02409488 2002-10-22
7
whether a microcontroller (60) is available or not. The microcontroller may
further process the results of the voice command output from the DSP system
100B to control one or more systems (not shown).
For example, the input/output pins 54 include one or more input pins 54A,
a visual output pin 54B and an action output pin 54C.
The input pins 54A are used to receive inputs. The input pins 54A may
be connected to switches to allow comniands, such as commands for
starting/ending recognition, starting/ending feature extraction for training,
starting/ending offline training, to be sent to the DSP system 100B. The
visual
output pin 54B is connected to devices, such as displays, LEDs, which provide
visual output to a user. For example, the visual output pin 54B is used to
inform
the user of the current state of the system (e.g. feature extraction for off-
line
training mode, recognition mode).
The action output pin 54C can be connected to various output devices.
For example, when the DSP system 100B recognizes a word, it activates one or
a combination of these pins to drive one or more external devices, such as a
speech synthesizer or a lamp.
The DSP system 100B can be applied to a wide variety of speech
processing systems.
In the following description, the DSP system 100B is exemplified as the
platform on which the speech recognition algorithm of the speech recognition
system 100 of Figure 1 runs.
Figure 4 is a flow diagram showing the operation of the DSP system 100B
of Figure 3 for voice command algorithm operations. Referring to Figures 3 and
4, in step S2, framing operation is applied to speech signals by the input-
output
processor 30. The input-output processor 30 creates frames of input speech
signals for subsequent processing by the WOLA filterbank 10 and associated
DSP core 20.
In step S4, a window, such as a Hanning window, is applied to the frame
by the WOLA filterbank 10 (i.e. the WOLA co-processor 14), so that the
distortions caused by transitions between frames are minimized, or at least

CA 02409488 2006-01-09
8
made acceptable for use in subsequent processing. After the windowing
operation, the WOLA filterbank 10 performs the OFBA including.a final FFT.
In step S6, the sum of the energy bins is computed, and the logarithm
(log) of the sum is taken by the DSP core 20. The Discrete Cosine Transform
(DCT) -is calcuTated using 1he-log of the sum in the DSP core 20 to generate
MFCC coefficients.
. In step S8, bin energy factorization operation (vector multiplication) is
executed by the WOLA co-processor 14.
Iri step S10, the total frame energy and energies of FFT bands are
computed by the DSP core 20. The endpoint detection algorithm is also
executed by the DSP core 20 to detect any non-speech frames by using one or
more features, such as MFCC s, FFT band energies and the frame total energy.
The words/speech frames are used for training/recognition (in step S12 and
S14).
Steps S2 to S10 are parts to the feature extraction and endpoint detection
processes. The data produced by these processes is stored in a circular buffer
(not shown) where they are retrieved during both the training and the
recognition
phases.
In step S12, training is carried out offline. HMM or DTW may be used for
the training. In step S14, speech recognition operation is done by the DSP
core
20. The DSP core 20 may employ Viterbi algorithm or DTW for the speech
recognition as explained below.
The steps S2, S4 (or S8) and S6 (or S10) are executed on the DSP 100B
of Figure 1 in parallel. The microcodeable WOLA filterbank 10 calculates FFT
coefficients in parallel with the rest of the system and calculates the band
energies during the process of MFCC calculation in parallel with the rest of
the
system. The DSP core 20 performs all other operations needed for speech
recognition in parallel with the operation of other components.
The features most commonly used today in speech recognition systems
are the MFCC and their first and second order differences. The number of
coefficients and differences required and used varies depending on the

CA 02409488 2006-01-09
9
implementation. The storage requirements for each word in the recognition
vocabulary and the processing requirements are directly linked with the number
of coefficients. The numbers of coefficients and derivatives vary and are
optimized based on the desired vocabulary size, response time and expected
- ---- - - --- -- - ---- - - ----------
quality of the recognition.
Figure 5 is a schematic diagram showing how the feature extraction is
performed on the DSP system of Figure 3. In Figure 5, the endpoint detection
is
also shown. In Figure 5, the three columns describe the tasks performed by the
three processors (i.e. the WOLA fiiterbank 10, the DSP core 20 and the input-
output processor 30 of Figure 3) running in parallel.
Referring to Figures 3 to 5, the blocks 310, 312, 314, 316 and 318
indicate the operations performed sequentially on a single frame (e.g. a frame
comprising 256 samples). The blocks 304 to 308 indicate the operations
performed on a previous frame (not shown). The blocks 320 to 326 indicate the
operations performed on a next frame (F2).
For example, the input-output processor 30 takes as input the speech
signal sampled by the 14-bit Analog/Digital (A/D) converter (not shown) on the
mixed-signal chip at a frequency of 8 kHz. The input-output processor 30
creates frames (Fl, F2 and F3), each of which includes 256 samples,
representing 32 milliseconds of speech (310, 320, 330, step S2).. The framing
operation is repeated on the input-output processor 30. The frames overlap for
128 sarnples (16 milliseconds). The pre-emphasis filter is appiied to each
signal.
The MFCC calculation is launched when the input-output processor 30
indicates that a new 256-sample frame (F1) is available for processing. This -
triggers a window process and an OFBA ending in a 256-point FFT, on the
WOLA filterbank 10 (i.e. WOLA co-processor 14) (312, step S4). The
oversampled filterbank is flexible and the FFT length can be chosen to be less
than 256 points if desired to minimize the computations.
When the 256-point FFT is completed, the DSP core 20 assigns the
resulting values to the L energy bins using a constant index table to map the
resulting 129 FFT bands to the L energy bins. In case other FFT lengths are
employed on the WOLA, the index table is changed accordingly. The next step

CA 02409488 2006-01-09
in the MFCC calculation consists of determining the logarithm of the energy of
L
energy bins, which are depicted in Figure 6 as triangular bands spread non-
linearly along the frequency axis. The DSP core 20 calculates the logarithm of
the energy of L energy bins. Then, the Discrete Cosine Transform (DCT) of the
L log energy bins is calculated to generate MFCC coefficients. The DCT
operation is implemented as the multiplication of the L log energy bins by a
constant matrix, whose dimensions are L by the desired number of MFCC
coefficients (314, step S6).
When the DCT operation is completed, the DSP core 20 launches the
vector multiply function of the WOLA co-processor 14 to calculate the total
frame
energy and energies of FFT bands, which multiplies the 129 FFT band energies
by a vector of constants stored in the RAM 40 (316, step S8).
When the vector multiplication is complete, the DSP core 20 (i.e. ALU 28)
determines the absolute value of each one of the 129 FFT bands as well as the
total frame energy (318, step S10).
Figures 4 and 5 show that the DCT is calculated by the DSP core 20.
However, the WOLA filterbank 10 may be used to calculate the DCT.
The endpoint is extracted in step S10 of Figure 4 and in blocks 308 and
318 of Figure 5. However, the endpoint detection may alternatively be executed
in step S6 of.Figure 4 and blocks 304, 314 and 324 of Figure 5, for example,
for
energy-based end point detection.
Figure 6 is schematic diagram showing how the triangular Mel Frequency
bands are spread along the frequency axis. In Figure 6, fO, f1, f2 represent
the
first three center frequencies of the energy bins. Assume that Ef; is the
energy in
FFT band i, and that when applying the filter for bin j, it is multiplied by
the
constant k;i. When applying the filter for bin j+1 to the FFT band, the
multiplying
constant becomes 1 - k;i. In consequence of this property, only half the
multiplications are required when applying the fiiters to the FFT bands, the
other
values being calculated by a relatively simple subtraction as described below.
Each FFT band energy is multiplied only by a single constant.
Figure 7 is a data flow diagram showing the energy bin calculation
process. Referring to Figures 3 and 7, the FFT band energies (a) 502 are first

CA 02409488 2006-01-09
11
multiplied by the vector of constants (multiplied by bin coefficient (k):504)
using
the WOLA co-processor 14. The resulting values, stored in the buffer a(k) are
then subtracted from the original band energies in the DSP core 20 and stored
in
a separate buffer a(1-k). These calculations are repeated until all FFT band
energies-are processed:
Figure 8 is a schematic diagram showing the operation of the DSP core
20 of Figure 3 after completing the calculation of the energy bin. When the
energy bin calculation is complete, the DSP core 20 assigns the values in the
two buffers a(k) and a(1-k) to the L energy bins using two constant index
tables
602 and 604 for mapping the FFT bands to the energy bin.
The index table contains the number of the bin to which the corresponding
buffer entry contributes. After the buffers a(k) and a(1-k) have been
assigned,
the log of the L energy bins is taken using a base-2 log function One
embodiment uses a 32-point look-up table and executes in 9 cycles, and has a
3% accuracy.
Once calculated, the MFCC coefficients are stored in a circular buffer
where they can be retrieved for training or recognition.
Referring to Figure 3, the endpoint detection is described in further detail.
For real-time operation and to meet the limited memory resources available,
the
endpoint detection is implemented using the DSP core 20 and related
components and features. As described above, the DSP core 20 detects any
non-speech frames by using one or more of features, such as MFCC s, FFT
band energies and the frame total energy. For example, the DSP core 20
compares the frame total energy with a threshold. The threshold is regularly
updated as function of a noise floor that is calculated during silence frames
by
the DSP core 20.
The pattern recognition is now described in further detail. The speech
recognition system employs two techniques: HMM and DTW.
In the case of HMM, the Viterbi algorithm is employed to find the
likelihood of Gaussian mixture HMMs. Generally, all model parameters, MFCC
coefficients and temporary likelihood values maintained during the execution
of
the Viterbi algorithm are represented as 16-bit fixed-point values. When

CA 02409488 2006-01-09
12
numbers are represented in fixed-point format, information may be lost during
multiplications because the result is truncated to 16 bits. However, the DSP
system 100B of Figure 3 features a rounding instruction (i.e. rounding the
results
of calculation) that reduces these quantization errors.
The Dynamic Time Warping (DT1IVYatgorifhm of tfie patferri reeWniTimffis
described. The DTW pattern-matching module of the DSP system 100B is used
to calculate the distance between a word just spoken by a user and each
individual reference word stored in memory. The DTW finds local distortions of
a
test utterance in the time-domain. According to the DTW, the test utterance
can
be aligned with a template by warping its time-domain axis.
A simple form of the DTW algorithm is used in the DSP system 100B of
Figure 3. Assume that the test utterance is composed of N feature vectors and
that a reference word is composed of M feature vectors. As described in
"Implementing a high accuracy speaker-independent continuous speech
recognizer on a fixed DSP", by Y. Gong and U. H. Kao, Proceedings of the
ICASSP 2000, pp. 3686-3689, the basic DTW algorithm consists of constructing
an N by M matrix, D, where D[m,n] is calculated as follows.
if m=1 and n=1 D[m,n] = d (1)
if m>1 and n=1 D[m,n] = D[m-1,1] + d (2)
if m=1 and n>1 D[m,n] = D[1,n-1] + d (3)
if m>1 and n>1 D[m,n] = minimum of (5) to (7) (4)
D[m,n-1] + d (5)
D[m-1,n] + d (6)
D[m-1,n-1] +.2 *d -(7)
In the equations (1) to (7), d represents the distance between a reference
word frame m and a test frame n.
The Euclidian distance is used for these operations.
When D has been calculated, the distance between the test utterance N and a
reference. word M is defined as D[M,N] divided by N+M. In the speech
recognition system implemented on the DSP system 100B; the N by M matrix is
reduced to a 2 by M matrix in which the first column represents the previous

CA 02409488 2004-12-22
13
values, i.e. the distances at test frame n-1, and the second column represents
the test frame for which the distance are currently calculated.
When the second column is filled, its values are simply copied to the first
column and the distances for test frame n+1 are calculated and inserted in the
second column.
The initial values (in the first column) are calculated as per equations (1)
and (2) above. For each frame in the test utterance, the first element of the
second column is calculated as per equation (3) and the other values of the
second column as per equation (4). When the end of the test utterance is
reached, the top-right element of the matrix is divided by N+M to obtain the
distance.
When the host 102 of Figure 1 executes the training using the DTW, the
host 102 includes the DTW pattem-matching module described above.
Noise reduction is now described in further detail. The DSP system 100B
15. also may execute the noise reduction algorithm for reducing noise
components.
The oversampled WOLA allows a wide range of gain and phase adjustments in
the frequency domain. For example, the noise reduction techniques, such as
spectral. subtraction, beam-forming, and subband adaptive filters, using the
oversampled WOLA process are described in uHighly Oversampled Subband
Adaptive Filters For Noise Cancellation On A Low-Resource DSP System" by
King Tam, Hamid Sheikhzadeh, Todd Schneider, Proceeding of ICSLP 2002, pp
1793-1796 and in "A Subband Beamformer On An Ultra Low-Power Miniature
DSP Plafform" by Edward Chau, Hamid Sheikhzadeh, Robert Brennan, Todd
Schneider, Proceeding of ICAPP 2002, page 111-2953 tcr 111-2956,
The noise reduction technique, such as
spectral. subtraction, beam-forming and subband adaptive filters may be
integrated with the feature extraction on the DSP system 100B, which is
specifically designed for the oversampled W.OLA process
According to the speech processing system 1000 of Figure 1, the speech
recognition system 100 of Figure 1 is implemented on the platform (100A, 100B)
having three processor units, which is designed for speech processing. The

CA 02409488 2002-10-22
14
speech recognition algorithm is efficiently mapped to the three processing
units
to achieve low resource utilisation.
The speech recognition algorithm and system are deployed on the
hardware platform (100B) specifically designed for speech processing.
Different
sections of the algorithm are allocated to specific components of the hardware
platform and are mapped in a manner that produces a speech recognition
system that uses much less power and is much smaller than current systems.
The type of hardware used and the way the speech recognition
algorithms are mapped to the hardware components makes the system as a
whole customizable in an efficient way, particularly in terms of integrating
noise
reduction. Noise reduction algorithms can be added easily and be adapted to
the situation. This results in robustness and ultimately produces better
recognition accuracy in noisy environments while maintaining low resource
usage.
The use of the specific hardware platform (100B) and the way that the
methods are implemented on this platform provides the following advantages: 1)
Capabilities and accuracy associated with the state-of-the-art methods.
Namely,
it is scalable as far as vocabulary size is concerned, provides very good
accuracy, and can be deployed in a number of applications such as voice
command, speaker identification, and continuous speech recognition; 2) Uses
very little power and occupies very little space; and 3) Suitable for the
integration
of state-of-the-art noise reduction algorithms.
The speech recognition system (100) is particularly useful in
environments where power consumption must be reduced to a minimum or
where an embedded processor does not have the capabilities to do speech
recognition. For example, it can be used in a personal digital assistant (PDA)
to
off-load the main processor in an efficient manner. The system, which may be
implemented in one or more highly integrated chips, can also be used in
conjunction with a micro-controller in embedded systems or in a standalone
environment.

CA 02409488 2002-10-22
The speech recognition algorithm is applicable to a wide range of
languages. Scalability in terms of vocabulary and applications is achieved.
The speech recognition algorithm can be applied to a variety of
applications using audio input, such as a voice command system in which an
5 incoming voice signal (i.e. command) is analyzed and results in some
operation,
speaker identification, speech recognition, or continuous speech recognition.
In the above description, various parameters used in the system, such as
sample rate, sample accuracy, window size, frame size, are used as examples.
Other parameters and accuracies may be substituted or chosen, depending on
10 the system requirements, and other factors.
While the present invention has been described with reference to specific
embodiments, the description is illustrative of the invention and is not to be
construed as limiting the invention. Various modifications may occur to those
skilled in the art without departing from scope of the invention as defined by
the
15 claims.

Representative Drawing
A single figure which represents the drawing illustrating the invention.
Administrative Status

2024-08-01:As part of the Next Generation Patents (NGP) transition, the Canadian Patents Database (CPD) now contains a more detailed Event History, which replicates the Event Log of our new back-office solution.

Please note that "Inactive:" events refers to events no longer in use in our new back-office solution.

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Event History , Maintenance Fee  and Payment History  should be consulted.

Event History

Description Date
Time Limit for Reversal Expired 2020-10-22
Common Representative Appointed 2019-10-30
Common Representative Appointed 2019-10-30
Letter Sent 2019-10-22
Change of Address or Method of Correspondence Request Received 2018-06-11
Letter Sent 2013-10-16
Inactive: Multiple transfers 2013-10-07
Inactive: IPC deactivated 2013-01-19
Inactive: IPC from PCS 2013-01-05
Inactive: IPC expired 2013-01-01
Inactive: IPC assigned 2012-12-13
Inactive: IPC assigned 2012-12-13
Letter Sent 2012-11-15
Letter Sent 2012-11-15
Letter Sent 2012-11-15
Inactive: Multiple transfers 2012-10-10
Letter Sent 2008-11-24
Inactive: Office letter 2008-10-29
Letter Sent 2008-02-25
Inactive: Office letter 2007-10-22
Grant by Issuance 2007-05-29
Inactive: Cover page published 2007-05-28
Inactive: Final fee received 2007-03-20
Pre-grant 2007-03-20
Notice of Allowance is Issued 2006-09-28
Letter Sent 2006-09-28
Notice of Allowance is Issued 2006-09-28
Inactive: Approved for allowance (AFA) 2006-05-03
Amendment Received - Voluntary Amendment 2006-04-28
Inactive: IPC from MCD 2006-03-12
Amendment Received - Voluntary Amendment 2006-01-09
Inactive: S.30(2) Rules - Examiner requisition 2005-07-12
Inactive: S.29 Rules - Examiner requisition 2005-07-12
Letter Sent 2005-04-01
Amendment Received - Voluntary Amendment 2004-12-22
Inactive: S.30(2) Rules - Examiner requisition 2004-07-08
Inactive: S.29 Rules - Examiner requisition 2004-07-08
Letter Sent 2003-07-17
Inactive: Single transfer 2003-06-10
Application Published (Open to Public Inspection) 2003-04-22
Inactive: Cover page published 2003-04-21
Inactive: First IPC assigned 2003-01-30
Inactive: Courtesy letter - Evidence 2002-12-17
Inactive: Filing certificate - RFE (English) 2002-12-11
Letter Sent 2002-12-11
Application Received - Regular National 2002-12-11
All Requirements for Examination Determined Compliant 2002-10-22
Request for Examination Requirements Determined Compliant 2002-10-22

Abandonment History

There is no abandonment history.

Maintenance Fee

The last payment was received on 2006-09-07

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

  • the reinstatement fee;
  • the late payment fee; or
  • additional fee to reverse deemed expiry.

Patent fees are adjusted on the 1st of January every year. The amounts above are the current amounts if received by December 31 of the current year.
Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
SEMICONDUCTOR COMPONENTS INDUSTRIES, LLC
Past Owners on Record
ALAIN DUFAUX
ETIENNE CORNU
HAMID SHEIKHZADEH-NADJAR
NICOLAS DESTREZ
ROBERT BRENNAN
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Description 2002-10-21 15 729
Abstract 2002-10-21 1 14
Drawings 2002-10-21 8 116
Claims 2002-10-21 4 120
Representative drawing 2003-01-29 1 11
Description 2004-12-21 15 726
Claims 2004-12-21 4 124
Claims 2006-01-08 4 140
Drawings 2006-01-08 8 113
Description 2006-01-08 16 767
Representative drawing 2007-05-10 1 12
Acknowledgement of Request for Examination 2002-12-10 1 174
Filing Certificate (English) 2002-12-10 1 159
Courtesy - Certificate of registration (related document(s)) 2003-07-16 1 105
Reminder of maintenance fee due 2004-06-22 1 111
Commissioner's Notice - Application Found Allowable 2006-09-27 1 161
Courtesy - Certificate of registration (related document(s)) 2012-11-14 1 103
Maintenance Fee Notice 2019-12-02 1 168
Correspondence 2002-12-10 1 24
Fees 2004-10-07 1 31
Fees 2005-09-20 1 33
Fees 2006-09-06 1 39
Correspondence 2007-03-19 2 49
Correspondence 2007-10-21 1 18
Fees 2007-10-04 1 31
Correspondence 2008-02-24 1 14
Correspondence 2008-01-22 1 32
Fees 2007-10-04 1 31
Correspondence 2008-10-28 1 16
Correspondence 2008-11-23 1 12
Fees 2008-10-07 1 32
Correspondence 2008-11-06 1 34
Fees 2008-10-07 1 36