Patent 3154029 Summary

(12) Patent Application:	(11) CA 3154029
(54) English Title:	DEEP LEARNING-BASED EMOTIONAL SPEECH SYNTHESIS METHOD AND DEVICE
(54) French Title:	PROCEDE ET DISPOSITIF DE SYNTHESE DE PAROLE EMOTIONNELLE FONDES SUR UN APPRENTISSAGE PROFOND
Status:	Examination Requested

Bibliographic Data

(51) International Patent Classification (IPC):	G10L 13/02 (2013.01)
(72) Inventors :	ZHONG, YUQI (China)
(73) Owners :	10353744 CANADA LTD. (Canada)
(71) Applicants :	10353744 CANADA LTD. (Canada)
(74) Agent:	HINTON, JAMES W.
(74) Associate agent:
(45) Issued:
(86) PCT Filing Date:	2020-06-19
(87) Open to Public Inspection:	2021-03-18
Examination requested:	2022-09-16
Availability of licence:	N/A
(25) Language of filing:	English

Patent Cooperation Treaty (PCT):	Yes
(86) PCT Filing Number:	PCT/CN2020/096998
(87) International Publication Number:	WO2021/047233
(85) National Entry:	2022-03-10

(30) Application Priority Data:

Application No.	Country/Territory	Date
201910850474.8	China	2019-09-10

Abstracts

English Abstract

A deep learning-based emotional speech synthesis method and device, the method at least comprising the following steps: extracting text information to be processed and preamble information of the text information to be processed (S1); by using the text information to be processed and the preamble information as an input, using a pre-constructed first model to generate emotion characteristic information (S2); and by using the emotion characteristic information and the text information to be processed as an input, using a pre-trained second model to synthesize emotional speech (S3). In the described method, on the basis of only acquiring text information, emotional speech may be synthesized on the basis of deep learning without needing to manually label the emotion of each acoustic pronunciation in advance. Moreover, labeling errors may further be reduced while labor costs are reduced, the suitability of emotion information is improved, conversational speech emotions are diversified, the naturalness and smoothness of synthesized speech are improved, the experience of human-machine communication is improved, and the adaptability is wide.

French Abstract

La présente invention concerne un procédé et un dispositif de synthèse de parole émotionnelle fondés sur un apprentissage profond, le procédé comprenant au moins les étapes suivantes consistant : à extraire des informations textuelles à traiter et des informations de préambule des informations textuelles à traiter (S1) ; par utilisation des informations textuelles à traiter et des informations de préambule en tant qu'entrée, utiliser un premier modèle préconstruit pour générer des informations caractéristiques d'émotion (S2) ; et par utilisation des informations caractéristiques d'émotion et des informations textuelles à traiter en tant qu'entrée, utiliser un second modèle préentraîné pour synthétiser une parole émotionnelle (S3). Dans le procédé décrit, sur la base uniquement d'une acquisition d'informations textuelles, une parole émotionnelle peut être synthétisée sur la base d'un apprentissage profond sans avoir besoin de marquer manuellement l'émotion de chaque prononciation acoustique à l'avance. De plus, les erreurs de marquage peuvent en outre être réduites tandis que les coûts de main-d'uvre sont réduits, le caractère approprié des informations d'émotion est amélioré, les émotions de la parole conversationnelle sont diversifiées, le caractère naturel et le caractère lisse de la parole synthétisée sont améliorés, l'expérience de la communication personne-machine est améliorée, et l'adaptabilité est vaste.

Claims

Note: Claims are shown in the official language in which they were submitted.

CA 03154029 2022-03-10
CLAIMS
1 . A deep learning-based emotional speech synthesis method, at least
comprising:
extracting text information to be processed and preamble information of the
process-
pending text information wherein the preamble information includes the
preamble text
infomiation;
using a pre-constructed first model to generate emotion feature information by
assigning
the process-pending text information and the preamble information as an input;
and
using a pre-trained second model to synthesize emotional speech by assigning
the emotion
feature information and the process-pending text information as an input.
2. The deep learning-based emotional speech synthesis method in claim 1, is
characterized in that,
the described first model is composed of a first sub model, a second sub
model, and a third sub
model, wherein the described use of a pre-constructed first model to generate
emotion feature
information by assigning the described process-pending text information and
the preamble
information as an input includes:
assigning the described process-pending text information and the preamble
information as
an input then extracting features by the pre-trained first sub model to obtain
a first
intermediate output;
assigning the described process-pending text information and the first
intermediate output
as an input, then extracting features by the pre-trained second sub model to
obtain an
emotion type and a second intemiediate output; and
assigning the described second intermediate output, process-pending text
information, and
the emotion type or a user-desired emotion type as an input, then extracting
features by the
pre-trained third sub model to obtain the emotion feature information.
3. The deep learning-based emotional speech synthesis method in claim 1, is
characterized in that,
when the described preamble information further includes preamble speech
information, the
described first model further includes a fourth sub model, a fifth sub model,
and a sixth sub
model connected in series, wherein the described use of a pre-constructed
first model to
generate emotion feature infomiation by assigning the described process-
pending text
infomiation and the preamble information as an input includes:
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
assigning the described process-pending text information and the preamble
information as
an input then extracting features by the pre-trained fourth sub model to
obtain a fourth
intermediate output;
assigning the described process-pending text information and the fourth
intermediate
output as an input, then extracting features by the pre-trained fifth sub
model to obtain an
emotion type and a fifth intermediate output; and
assigning the described fifth intermediate output, process-pending text
information, and
the emotion type or a user-desired emotion type as an input, then extracting
features by the
pre-trained sixth sub model to obtain the emotion feature information.
4. The deep learning-based emotional speech synthesis method in claims 2 or 3,
is characterized
in that, the pre-training process of the described second model includes:
extracting a video image sample, a text information sample, and a conversation
information
sample of a video sample;
labelling the described video image sample according to the pre-set emotional
type to
obtain an emotion labelling information sample;
training to obtain the third model by assigning the described video image
sample as an
input and the described emotion labelling information sample as an output,
then extracting
the third intermediate output of the described third model as the emotion
information of
the described video image sample; and
training to obtain the second model by assigning the described emotion
information and
the text information sample as an input, and the conversation information
sample as an
output.
5. The deep learning-based emotional speech synthesis method in claim 4, is
characterized in that,
the pre-training process of the described first model includes:
extracting the current text information sample and the preamble information
sample of the
video sample, wherein the described preamble information sample includes a
preamble text
information sample;
training to obtain the described first sub model by an input of the described
current text
information sample and the preamble information sample and an output of that
if the
2 1
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
current text information sample has changed compared to the preamble
information sample,
then extracting the first intermediate output by the described first sub
model;
training to obtain the described second sub model by assigning the first
intermediate output
and the current text information sample as an input and the emotion type as an
output, then
extracting the second intermediate output by the second sub model; and
training to obtain the described third sub model by an input of the second
intermediate
output, the current text information sample, and the emotion type or a user-
desired emotion
type as an input, and an output of the emotion information by the third model.
6. The deep learning-based emotional speech synthesis method in claim 4, is
characterized in that,
the pre-training process of the described first model includes:
extracting the current text information sample and the preamble information
sample of the
video sample, wherein the described preamble information sample includes a
preamble text
information sample and a preamble speech information sample;
training to obtain the described fourth sub model by an input of the described
current text
information sample and the preamble information sample and an output of that
if the
current text information sample has changed compared to the preamble
information sample,
then extracting the fourth intermediate output by the described fourth sub
model;
training to obtain the described fifth sub model by assigning the first
intermediate output
and the current text information sample as an input and the emotion type as an
output, then
extracting the fifth intermediate output by the fifth sub model; and
training to obtain the described sixth sub model by an input of the second
intermediate
output, the current text information sample, and the emotion type or a user-
desired emotion
type as an input, and an output of the emotion information by the third model.
7. The deep learning-based emotional speech synthesis method in claims 5 or 6,
is characterized
in that, the described pre-training process of the described second model
further includes the
pre-processing of the video sample, at least comprising:
dividing the described video image sample into several segmented video image
samples
according to a pre-set time interval, then defining the texts within any time
interval as the
22
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
current time information sample, and defining the texts before the mentioned
any time
interval as the preamble text information sample.
8. A deep learning-based emotional speech synthesis device on the basis of any
methods in claims
1 ¨ 7, at least comprising:
an extraction module, configured to extract text information to be processed
and preamble
information of the process-pending text information, wherein the preamble
information
includes the preamble text information;
an emotion feature information generation module, configured to use a pre-
constructed first
model for generating emotion feature information by assigning the process-
pending text
information and the preamble information as an input; and
an emotional speech synthesis module, configured to use a pre-trained second
model for
synthesizing emotional speech by assigning the emotion feature information and
the
process-pending text information as an input.
9. The deep learning-based emotional speech synthesis device on the basis of
any methods in
claim 8, is characterized in that, wherein the described first model is
composed of a first sub
model, a second sub model, and a third sub model, the described emotion
feature information
generation module at least includes:
a first feature extraction unit, configured to assign the described process-
pending text
information and the preamble information as an input then extract features by
the pre-
trained first sub model to obtain a first intermediate output;
a second feature extraction unit, configured to assign the described process-
pending text
information and the first intermediate output as an input, then extract
features by the pre-
trained second sub model to obtain an emotion type and a second intermediate
output; and
a third feature extraction unit, configured to assign the described second
intermediate
output, process-pending text information, and the emotion type or a user-
desired emotion
type as an input, then extract features by the pre-trained third sub model to
obtain the
emotion feature information.
10. The deep learning-based emotional speech synthesis device on the basis of
any methods in
claim 8, is characterized in that, when the described preamble information
further includes
23
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
preamble speech information, the described first model further includes a
fourth sub model, a
fifth sub model, and a sixth sub model connected in series, wherein the
described emotion
feature information generation module at least includes:
a fourth feature extraction module, configured to assign the described process-
pending text
information and the preamble information as an input then extract features by
the pre-
trained fourth sub model to obtain a fourth intemiediate output;
a fifth feature extraction module, configured to assign the described process-
pending text
information and the fourth intermediate output as an input, then extract
features by the pre-
trained fifth sub model to obtain an emotion type and a fifth intermediate
output; and
a sixth feature extraction module, configured to assign the described fifth
intermediate
output, process-pending text information, and the emotion type or a user-
desired emotion
type as an input, then extract features by the pre-trained sixth sub model to
obtain the
emotion feature information.
24
Date Recue/Date Received 2022-03-10

Description

Note: Descriptions are shown in the official language in which they were submitted.

CA 03154029 2022-03-10
DEEP LEARNING-BASED EMOTIONAL SPEECH SYNTHESIS METHOD AND
DEVICE
Technical field
The present invention relates to the field of speech synthesis technologies,
in particular to
a deep learning-based emotional speech synthesis method and a device.
Background
With the development of the society, people wish to develop robots to
substitute some
simple and redundant manual work, such as announcing and simple customer
service tasks,
wherein humans should smoothly and naturally communicate with the robots.
Speech, as an
important communication tool for human society, can contribute to natural and
harmonic man-
machine communication. Therefore, speech synthesis plays an important role in
emotional
computation and signal processing. In the meanwhile, delicate emotional
expression can greatly
improve the synthesized speech quality.
The current methods generally rely on labelled information, wherein each
acoustic
pronunciation in a speech is labelled with texts and emotions manually. On the
other hand, the
average and variance of the emotion's basic frequency, speech power, time
variance and other
parameters are set manually. In other words, by setting all specifications,
segments are acquired
and combined for synthesis.
The described traditional approaches are completed manually, wherein the
labelling
workers are trained with the associated labelling process, wherein different
works tend to have
different understanding for standard labels. As a result, the mood in a
sentence is understood
differently, leading to non-standardized labels and large labelling errors.
While the emotion
matching is poor, the applicable scenarios of labelling contents are
inflexible and not diversified.
The speech emotions out of the applicable range will be very mechanical and
strange. In the
meanwhile, the data labelling requires a great amount of manual labor.
Summary
Aiming at the aforementioned technical problems, a deep learning-based
emotional speech
synthesis method and a device are provided in the present invention,
permitting emotional speech
1
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
synthesis without requiring manual labelling for text emotions.
The technical proposals provided in the present invention includes:
From the first perspective, a deep learning-based emotional speech synthesis
method is
provided, at least comprising:
extracting text information to be processed and preamble information of the
process-
pending text information, wherein the preamble information includes the
preamble text
information;
using a pre-constructed first model to generate emotion feature information by
assigning
the process-pending text information and the preamble information as an input;
and
using a pre-trained second model to synthesize emotional speech by assigning
the emotion
feature information and the process-pending text information as an input.
In some preferred embodiments, the described first model is composed of a
first sub model,
a second sub model, and a third sub model, wherein the described use of a pre-
constructed first
model to generate emotion feature information by assigning the described
process-pending text
information and the preamble information as an input includes:
assigning the described process-pending text information and the preamble
information as
an input then extracting features by the pre-trained first sub model to obtain
a first intermediate
output;
assigning the described process-pending text information and the first
intermediate output
as an input, then extracting features by the pre-trained second sub model to
obtain an emotion type
and a second intermediate output; and
assigning the described second intermediate output, process-pending text
information, and
the emotion type or a user-desired emotion type as an input, then extracting
features by the pre-
trained third sub model to obtain the emotion feature information.
In some preferred embodiments, when the described preamble information further
includes
preamble speech information, the described first model further includes a
fourth sub model, a fifth
sub model, and a sixth sub model connected in series, wherein the described
use of a pre-
constructed first model to generate emotion feature information by assigning
the described
process-pending text information and the preamble information as an input
includes:
assigning the described process-pending text information and the preamble
information as
an input then extracting features by the pre-trained fourth sub model to
obtain a fourth intermediate
2
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
output;
assigning the described process-pending text information and the fourth
intermediate
output as an input, then extracting features by the pre-trained fifth sub
model to obtain an emotion
type and a fifth intermediate output; and
assigning the described fifth intermediate output, process-pending text
information, and
the emotion type or a user-desired emotion type as an input, then extracting
features by the pre-
trained sixth sub model to obtain the emotion feature information.
In some preferred embodiments, the pre-training process of the described
second model
includes:
extracting a video image sample, a text information sample, and a conversation
information
sample of a video sample;
labelling the described video image sample according to the pre-set emotional
type to
obtain an emotion labelling information sample;
training to obtain the third model by assigning the described video image
sample as an
input and the described emotion labelling information sample as an output,
then extracting the
third intermediate output of the described third model as the emotion
information of the described
video image sample; and
training to obtain the second model by assigning the described emotion
information and
the text information sample as an input, and the conversation information
sample as an output.
In some preferred embodiments, the pre-training process of the described first
model
includes:
extracting the current text information sample and the preamble information
sample of the
video sample, wherein the described preamble information sample includes a
preamble text
information sample;
training to obtain the described first sub model by an input of the described
current text
information sample and the preamble information sample and an output of that
if the current text
information sample has changed compared to the preamble information sample,
then extracting
the first intermediate output by the described first sub model;
training to obtain the described second sub model by assigning the first
intermediate output
and the current text information sample as an input and the emotion type as an
output, then
extracting the second intermediate output by the second sub model; and
3
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
training to obtain the described third sub model by an input of the second
intermediate
output, the current text information sample, and the emotion type or a user-
desired emotion type
as an input, and an output of the emotion information by the third model.
In some preferred embodiments, the pre-training process of the described first
model
includes:
extracting the current text information sample and the preamble information
sample of the
video sample, wherein the described preamble information sample includes a
preamble text
information sample and a preamble speech information sample;
training to obtain the described fourth sub model by an input of the described
current text
information sample and the preamble information sample and an output of that
if the current text
information sample has changed compared to the preamble information sample,
then extracting
the fourth intermediate output by the described fourth sub model;
training to obtain the described fifth sub model by assigning the first
intermediate output
and the current text information sample as an input and the emotion type as an
output, then
extracting the fifth intermediate output by the fifth sub model; and
training to obtain the described sixth sub model by an input of the second
intermediate
output, the current text information sample, and the emotion type or a user-
desired emotion type
as an input, and an output of the emotion information by the third model.
In some preferred embodiments, the described pre-training process of the
described second
model further includes the pre-processing of the video sample, at least
comprising:
dividing the described video image sample into several segmented video image
samples
according to a pre-set time interval, then defining the texts within any time
interval as the current
time information sample, and defining the texts before the mentioned time
interval as the preamble
text information sample.
From the other perspective, a deep learning-based emotional speech synthesis
device on
the basis of any described methods is further provided, at least comprising:
an extraction module, configured to extract text information to be processed
and preamble
information of the process-pending text information, wherein the preamble
information includes
the preamble text information;
an emotion feature information generation module, configured to use a pre-
constructed first
model for generating emotion feature information by assigning the process-
pending text
4
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
information and the preamble information as an input; and
an emotional speech synthesis module, configured to use a pre-trained second
model for
synthesizing emotional speech by assigning the emotion feature information and
the process-
pending text information as an input.
In some preferred embodiments, wherein the described first model is composed
of a first
sub model, a second sub model, and a third sub model, the described emotion
feature information
generation module at least includes:
a first feature extraction unit, configured to assign the described process-
pending text
information and the preamble information as an input then extract features by
the pre-trained first
sub model to obtain a first intermediate output;
a second feature extraction unit, configured to assign the described process-
pending text
information and the first intermediate output as an input, then extract
features by the pre-trained
second sub model to obtain an emotion type and a second intermediate output;
and
a third feature extraction unit, configured to assign the described second
intermediate
output, process-pending text information, and the emotion type or a user-
desired emotion type as
an input, then extract features by the pre-trained third sub model to obtain
the emotion feature
information.
In some preferred embodiments, when the described preamble information further
includes
preamble speech information, the described first model further includes a
fourth sub model, a fifth
sub model, and a sixth sub model connected in series, wherein the described
emotion feature
information generation module at least includes:
a fourth feature extraction module, configured to assign the described process-
pending text
information and the preamble information as an input then extract features by
the pre-trained fourth
sub model to obtain a fourth intermediate output;
a fifth feature extraction module, configured to assign the described process-
pending text
information and the fourth intermediate output as an input, then extract
features by the pre-trained
fifth sub model to obtain an emotion type and a fifth intermediate output; and
a sixth feature extraction module, configured to assign the described fifth
intermediate
output, process-pending text information, and the emotion type or a user-
desired emotion type as
an input, then extract features by the pre-trained sixth sub model to obtain
the emotion feature
information.
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
In some preferred embodiments, the described device further includes a model
training
module. The described model training module is at least composed of the second
model training
unit to train the described second model. The described second model training
unit at least includes:
a first extraction subunit, configured to extract a video image sample, a text
information
sample, and a conversation information sample of a video sample;
an emotion labeling subunit, configured to label the described video image
sample
according to the pre-set emotional type to obtain an emotion labelling
information sample;
a first training subunit, configured to train and obtain the third model by
assigning the
described video image sample as an input and the described emotion labelling
information sample
as an output, then extract the third intermediate output of the described
third model as the emotion
information of the described video image sample; and
the first training subunit, further configured to train and obtain the second
model by
assigning the described emotion information and the text information sample as
an input, and the
conversation information sample as an output.
In some preferred embodiments, the described model training module is further
composed
of a first model training unit, configured to train the described first model,
comprising:
a second extraction subunit, configured to extract the current text
information sample and
the preamble information sample of the video sample, wherein the described
preamble information
sample includes a preamble text information sample;
a second training subunit, configured to train and obtain the described first
sub model by
an input of the described current text information sample and the preamble
information sample
and an output of that if the current text information sample has changed
compared to the preamble
information sample, then extract the first intermediate output by the
described first sub model;
the second training subunit, further configured to train and obtain the
described second sub
model by assigning the first intermediate output and the current text
information sample as an input
and the emotion type as an output, then extract the second intermediate output
by the second sub
model; and
the second training subunit, further configured to train and obtain the
described third sub
model by an input of the second intermediate output, the current text
information sample, and the
emotion type or a user-desired emotion type as an input, and an output of the
emotion information
by the third model.
6
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
In some preferred embodiments, the described model training module is further
composed
of a third model training unit, configured to train another first model,
comprising:
a third extraction subunit, configured to extract the current text information
sample and the
preamble information sample of the video sample, wherein the described
preamble information
sample includes a preamble text information sample and a preamble speech
information sample;
a third training subunit, configured to train and obtain the described fourth
sub model by
an input of the described current text information sample and the preamble
information sample
and an output of that if the current text information sample has changed
compared to the preamble
information sample, then extract the fourth intermediate output by the
described fourth sub model;
the third training subunit, further configured to train and obtain the
described fifth sub
model by assigning the first intermediate output and the current text
information sample as an input
and the emotion type as an output, then extracting the fifth intermediate
output by the fifth sub
model; and
the third training subunit, further configured to train and obtain the
described sixth sub
model by an input of the second intermediate output, the current text
information sample, and the
emotion type or a user-desired emotion type as an input, and an output of the
emotion information
by the third model.
In some preferred embodiments, the described second model training unit
further includes:
a pre-processing subunit, configured to divide the described video image
sample into
several segmented video image samples according to a pre-set time interval,
then define the texts
within any time interval as the current time information sample, and defined
the texts before the
mentioned any time interval as the preamble text information sample.
The benefits provided in the present invention include that:
the deep learning-based emotional speech synthesis method disclosed in the
present
invention generates emotion feature information by a pre-constructed first
model based on
extracted process-pending text information and the preamble information of the
described process-
pending text information; synthesizes emotional speech according to the
emotion feature
information and the process-pending text information based on a second model
pre-trained by
video samples. In the described method, only the text information is acquired,
wherein the
emotional speech can be synthesized based on deep learning without manually
labeled emotion of
each acoustic pronunciation in advance. Moreover, labeling errors and labor
costs are further
7
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
reduced, with improving the suitability of emotion information, diversifying
conversational speech
emotions, enhancing the naturalness and smoothness of synthesized speech,
promoting the
experience of human-machine communication, and widening the applicability.
Besides, during the model training in the present invention, based on the
video image
information, text information and the speech information in a video, the
emotion information is
first obtained by the video image to construct a video image based speech
synthesis module. Then,
based on the text information, an emotional speech synthesis module targeting
the emotion
information is constructed, to achieve the emotional speech generation based
on text information.
The described method is applicable to scenarios of video communications, voice
communications,
and communications with text information. The widened applicability of the
present method can
further promote man-machine communication experience.
In addition, in the deep learning-based emotional speech synthesis method, the
speech
synthesis model (the second model) is constructed according to video image
samples,
corresponding text information samples, and conversation information samples.
Therefore, the
acquired emotion is more delicate, and the synthesized speech has more
accurate and natural
emotion and tone.
To clarify, the proposals in the present invention can only achieve one of
technical benefits.
Brief descriptions of the drawings
For a better explanation of the technical proposal of embodiments in the
present invention,
the accompanying drawings are briefly introduced in the following. Obviously,
the following
drawings represent only a portion of embodiments of the present invention.
Those skilled in the
art are able to create other drawings according to the accompanying drawings
without making
creative efforts.
Fig. 1 is a flow diagram of a deep learning-based emotional speech synthesis
method in
embodiment one of the present invention;
Fig. 2 is a logical diagram of the deep learning-based emotional speech
synthesis method
in embodiment one of the present invention;
Fig. 3 is a logical diagram of training the second model in embodiment one of
the present
invention;
Fig. 4 is a logical diagram of training the first model in embodiment one of
the present
8
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
invention; and
Fig. 5 is a structure diagram of the deep learning-based emotional speech
synthesis device
in embodiment two of the present invention.
Detailed descriptions
In order to make the objective, the technical protocol, and the advantages of
the present
invention clearer, the present invention will be explained further in detail
precisely below with
references to the accompanying drawings. Obviously, the embodiments described
below are only
a portion of embodiments of the present invention and cannot represent all
possible embodiments.
Based on the embodiments in the present invention, the other applications by
those skilled in the
art without any creative works are falling within the scope of the present
invention.
Embodiment one
As shown in Fig. 1, a deep learning-based emotional speech synthesis method is
provided
in the present embodiment, related to the field of voice synthesis.
Implementing the present method
can permit emotional speech synthesis without requiring manually labelled
emotions, and
effectively improve the quality of emotion in the synthesized speech.
Referring to Fig. 1 and 2, the present method comprises:
Si, extracting text information to be processed and preamble information of
the process-
pending text information.
In particular, where if the processing target is a text object, the preamble
information
includes preamble text information; and
where if the processing target is a voice object or a video object, the
preamble information
includes the preamble text information and the preamble speech information.
To clarify, in the present step, different extraction tools can be used to
extract the text
information from a text object, the text information and the voice information
from a voice
information, and the text information and the voice information from a video
information. The
detailed adoption methods are general techniques in the field, and are not
explained in further
detail.
S2, using a pre-constructed first model to generate emotion feature
information by
assigning the process-pending text information and the preamble information as
an input.
In particular, where if the processing target is a text object, the step S2
comprises the
9
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
following sub-steps:
S211, assigning the described process-pending text information and the
preamble
information as an input then extracting features by the pre-trained first sub
model to obtain a first
intermediate output;
S212, assigning the described process-pending text information and the first
intermediate
output as an input, then extracting features by the pre-trained second sub
model to obtain an
emotion type and a second intermediate output; and
S213, assigning the described second intermediate output, process-pending text

information, and the emotion type or a user-desired emotion type as an input,
then extracting
features by the pre-trained third sub model to obtain the emotion feature
information.
In particular, one of the input ports of the third sub model is an emotion
controlling port,
wherein the emotion controlling port can input the emotion type exported by
the second sub model,
or input an emotion type set manually by a user. Therefore, the acquisition of
emotion types can
be fully based on models. When model data has poor reliability, emotion types
can be manually
adjusted, to further improve the accuracy and reliability of the acquired
emotion information.
In particular, the first intermediate output is the output feature vector from
the layer before
the logical judgement layer of the first sub model, including the current
conversation tone extracted
by the first sub model and the emotion characteristics of the current texts.
The second intermediate
output is the output feature vector from the layer before the classification
layer of the second sub
model, including the emotion characteristics of the current text extracted by
the second sub model
according to the first intermediate output.
As another preferred embodiment, when the process target is a voice object or
a video
object, the step S2 includes the following sub-steps:
S221, assigning the described process-pending text information and the
preamble
information as an input then extracting features by the pre-trained fourth sub
model to obtain a
fourth intermediate output;
S222, assigning the described process-pending text information and the fourth
intermediate
output as an input, then extracting features by the pre-trained fifth sub
model to obtain an emotion
type and a fifth intermediate output; and
S223, assigning the described fifth intermediate output, process-pending text
information,
and the emotion type or a user-desired emotion type as an input, then
extracting features by the
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
pre-trained sixth sub model to obtain the emotion feature information.
In particular, the fourth intermediate output is the output feature vector
from the layer
before the logical judgement layer of the fourth sub model, including the
current conversation tone
and the emotion characteristics of the current texts extracted by the fourth
sub model according to
the input conversation speech or the input video. The fifth intermediate
output is the output feature
vector from the layer before the classification layer of the fifth sub model,
including the emotion
characteristics of the current text extracted by the second sub model
according to the fourth
intermediate output.
Being verified, the current text information includes the preamble text
information and the
preamble speech information. The acquired emotion speech characteristics
information is more
reliable.
S3, using a pre-trained second model to synthesize emotional speech by
assigning the
emotion feature information and the process-pending text information as an
input.
Based on the aforementioned step S2, the natural-tone emotional speech can be
generated
on the basis of the text information.
Therefore, the deep learning-based emotional speech synthesis method disclosed
in the
present invention generates emotion feature information by a pre-constructed
first model based on
extracted process-pending text information and the preamble information of the
described process-
pending text information; synthesizes emotional speech according to the
emotion feature
information and the process-pending text information based on a second model
pre-trained by
video samples. In the described method, only the text information is acquired,
wherein the
emotional speech can be synthesized based on deep learning without manually
labeled emotion of
each acoustic pronunciation in advance. Moreover, labeling errors and labor
costs are further
reduced, with improving the suitability of emotion information, diversifying
conversational speech
emotions, enhancing the naturalness and smoothness of synthesized speech, and
promoting the
experience of human-machine communication.
Besides, when the present method is implemented to synthesize emotional
speeches, the
processing target can be text only, or the combination of the texts and
voices. Therefore, the present
method can be applicable to any of texts, voices, and videos for emotional
speech synthesis, with
wide applicability.
Furthermore, the present method further includes model training steps, to pre-
train the first
11
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
model and the second model. First of all, the pre-training process of the
described second model
includes:
Sal, extracting a video image sample, a text information sample, and a
conversation
information sample of a video sample;
5a2, labelling the described video image sample according to the pre-set
emotional type to
obtain an emotion labelling information sample; and
5a3, training to obtain the third model by assigning the described video image
sample as
an input and the described emotion labelling information sample as an output,
then extracting the
third intermediate output of the described third model as the emotion
information of the described
video image sample; and training to obtain the second model by assigning the
described emotion
information and the text information sample as an input, and the conversation
information sample
as an output.
Illustratively, the third model is constructed based on ResNet-50, carrying
the cross-
entropy loss function. The second model is constructed based on Tacotron2,
carrying the mean
square error loss function and L2 distance loss function.
In particular, as shown in Fig. 3, the third model is connected with the
second model in
series and two models are trained together. After extracting the video image
samples, text
information samples, and conversation information samples, the video image
samples are sent to
the third model input port (I3) and the third intermediate output (03) is sent
to an input port of the
second model (I51). The second model uses the text information samples as
inputs (152). The third
model and the second model take the emotion labelling information (032) and
the conversation
information samples (05) as targets, respectively, to train both models
together. The third
intermediate output (031) is intercepted as the emotion information.
After completion of the second model, the first model is trained. The first
model can be
constructed according to applied objects. For example, the model applied to
text-based contents is
different from the model for voice or video-based contents. After receiving
the process-pending
objects, the system automatically determines the object types and select an
appropriate first model.
When the first model only applicable to text-based contents, the training
process of the first
model includes:
Sbl, extracting the current text information sample and the preamble
information sample
of the video sample, wherein the described preamble information sample
includes a preamble text
12
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
information sample;
Sb2, training to obtain the described first sub model by an input of the
described current
text information sample and the preamble information sample and an output of
that if the current
text information sample has changed compared to the preamble information
sample, then
extracting the first intermediate output by the described first sub model;
Sb3, training to obtain the described second sub model by assigning the first
intermediate
output and the current text information sample as an input and the emotion
type as an output, then
extracting the second intermediate output by the second sub model; and
Sb4, training to obtain the described third sub model by an input of the
second intermediate
output, the current text information sample, and the emotion type or a user-
desired emotion type
as an input, and an output of the emotion information by the third model.
In detail, as shown in Fig. 4, the first sub model, the second sub model, and
the third sub
model are connected in series. After the preamble information samples and the
preamble text
information samples are acquired, the described three sub models are trained
together.
Illustratively, the first sub model is constructed based on Trasformer-x 1,
wherein the
LSTM+CNN structure replaces the Decoder part, and defined as the logical
judgement of the first
sub model to generate outputs. The outputs carry the cross-entropy loss
function. The second sub
model is constructed based on Transformer, wherein the LSTM+CNN structure
replaces the
Decoder part, and defined as the logical judgement of the second sub model to
generate outputs.
The outputs carry the cross-entropy loss function. The third sub model is
constructed based on
StarGAN, wherein the ConvID network layer replaces the Conv2D in the
structure. The output
carries mean square error loss function and L2 distance loss function.
The preamble information samples and the current text information samples are
defined as
two inputs for the first model (I11, 112). In particular, the current text
information samples are
assigned as an input for each sub model (I11, 121, 142). the current
information samples are the other
input (I12) for the first sub model, and the output (012) is that whether the
emotion of the current
text information sample is changed compared to the emotion of the preamble
information sample.
The first intermediate output (OH) is assigned as the other input for the
second sub model (121),
and the output of the second sub model is the emotion types (022). The second
intermediate output
is assigned as the other input for the third sub model (I41), and the emotion
information acquired
by the third model is assigned as the output (04). And the aforementioned
three sub models are
13
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
trained together.
When a first model applicable to voice or video contents is trained, the
training process
includes:
Sc 1, extracting the current text information sample and the preamble
information sample
of the video sample, wherein the described preamble information sample
includes a preamble text
information sample and a preamble speech information sample;
5c2, training to obtain the described fourth sub model by an input of the
described current
text information sample and the preamble information sample and an output of
that if the current
text information sample has changed compared to the preamble information
sample, then
extracting the fourth intermediate output by the described fourth sub model;
5c3, training to obtain the described fifth sub model by assigning the first
intermediate
output and the current text information sample as an input and the emotion
type as an output, then
extracting the fifth intermediate output by the fifth sub model; and
5c4, training to obtain the described sixth sub model by an input of the
second intermediate
output, the current text information sample, and the emotion type or a user-
desired emotion type
as an input, and an output of the emotion information by the third model.
Illustratively, the fourth sub model is constructed based on ResNet-50 and
Transformer-
xl, wherein the dense layer of ResNet-50 is discarded and the ConvLstm2D
structural network
layer replaces the Conv2D in ResNet-50. The pooling layer output of the ResNet-
50 is merged to
the output of Encoder in Transformer-xl. The LSTM+CNN structure replaces the
decoder part of
the, and defined as the logical judgement of the fourth sub model to generate
outputs, carrying
cross-entropy loss function. The fifth sub model is constructed based on
Transformer, wherein the
LSTM+CNN structure replaces the Decoder part, and defined as the logical
judgement of the fifth
sub model to generate outputs. The outputs carry the cross-entropy loss
function. The sixth model
is constructed based on StarGAN, wherein the Cony 1D structure network layer
replaces the
Conv2D structure network layer. The output carries mean square error loss
function and L2
distance loss function.
The two training methods of the first model are the same, wherein the relevant
input and
output relationships can refer to the first training model. The only
difference is that the preamble
speech information samples are added to the fourth sub model in the second
training method.
Therefore, during the model training in the present invention, based on the
video image
14
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
information, text information and the speech information in a video, the
emotion information is
first obtained by the video image to construct a video image based speech
synthesis module. Then,
based on the text information, an emotional speech synthesis module targeting
on the emotion
information is constructed, to achieve the emotional speech generation based
on text information.
The described method is applicable to scenarios of video communications, voice
communications,
and communications with text information. The widened applicability of the
present method can
further promote man-machine communication experience.
In addition, in the deep learning-based emotional speech synthesis method, the
speech
synthesis model (the second model) is constructed according to video image
samples,
corresponding text information samples, and conversation information samples.
Therefore, the
acquired emotion is more delicate, and the synthesized speech has more
accurate and natural
emotion and tone.
Embodiment two
To implement the deep learning-based emotional speech synthesis method in
embodiment
one, a deep learning-based emotional speech synthesis device 100 is provided
in the present
embodiment.
Fig. 5 is a structure diagram of the deep learning-based emotional speech
synthesis device.
As shown in Fig. 5, the device 100 at least comprises:
an extraction module 1, configured to extract text information to be processed
and
preamble information of the process-pending text information, wherein the
preamble information
includes the preamble text information;
an emotion feature information generation module 2, configured to use a pre-
constructed
first model for generating emotion feature information by assigning the
process-pending text
information and the preamble information as an input; and
an emotional speech synthesis module 3, configured to use a pre-trained second
model for
synthesizing emotional speech by assigning the emotion feature information and
the process-
pending text information as an input.
In some preferred embodiments, wherein the described first model is composed
of a first
sub model, a second sub model, and a third sub model, the described emotion
feature information
generation module 2 at least includes:
a first feature extraction unit 21, configured to assign the described process-
pending text
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
information and the preamble information as an input then extract features by
the pre-trained first
sub model to obtain a first intermediate output;
a second feature extraction unit 22, configured to assign the described
process-pending text
information and the first intermediate output as an input, then extract
features by the pre-trained
second sub model to obtain an emotion type and a second intermediate output;
and
a third feature extraction unit 23, configured to assign the described second
intermediate
output, process-pending text information, and the emotion type or a user-
desired emotion type as
an input, then extract features by the pre-trained third sub model to obtain
the emotion feature
information.
In some preferred embodiments, when the described preamble information further
includes
preamble speech information, the described first model further includes a
fourth sub model, a fifth
sub model, and a sixth sub model connected in series, wherein the described
emotion feature
information generation module 2 at least includes:
a fourth feature extraction module 21', configured to assign the described
process-pending
text information and the preamble information as an input then extract
features by the pre-trained
fourth sub model to obtain a fourth intermediate output;
a fifth feature extraction module 22', configured to assign the described
process-pending
text information and the fourth intermediate output as an input, then extract
features by the pre-
trained fifth sub model to obtain an emotion type and a fifth intermediate
output; and
a sixth feature extraction module 23', configured to assign the described
fifth intermediate
output, process-pending text information, and the emotion type or a user-
desired emotion type as
an input, then extract features by the pre-trained sixth sub model to obtain
the emotion feature
information.
In some preferred embodiments, the described device further includes a model
training
module 4. The described model training module 4 is at least composed of the
second model training
unit 41to train the described second model. The described second model
training unit 4lat least
includes:
a first extraction subunit 411, configured to extract a video image sample, a
text
information sample, and a conversation information sample of a video sample;
an emotion labeling subunit 412, configured to label the described video image
sample
according to the pre-set emotional type to obtain an emotion labelling
information sample; and
16
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
a first training subunit 413, configured to train and obtain the third model
by assigning the
described video image sample as an input and the described emotion labelling
information sample
as an output, then extract the third intermediate output of the described
third model as the emotion
information of the described video image sample; and further configured to
train and obtain the
second model by assigning the described emotion information and the text
information sample as
an input, and the conversation information sample as an output.
In some preferred embodiments, the described model training module is further
composed
of a first model training unit 42, configured to train the described first
model 42, comprising:
a second extraction subunit 421, configured to extract the current text
information sample
and the preamble information sample of the video sample, wherein the described
preamble
information sample includes a preamble text information sample;
a second training subunit 422, configured to train and obtain the described
first sub model
by an input of the described current text information sample and the preamble
information sample
and an output of that if the current text information sample has changed
compared to the preamble
information sample, then extract the first intermediate output by the
described first sub model;
the second training subunit, further configured to train and obtain the
described second sub
model by assigning the first intermediate output and the current text
information sample as an input
and the emotion type as an output, then extract the second intermediate output
by the second sub
model; and
the second training subunit, further configured to train and obtain the
described third sub
model by an input of the second intermediate output, the current text
information sample, and the
emotion type or a user-desired emotion type as an input, and an output of the
emotion information
by the third model.
In some preferred embodiments, the described model training module is further
composed
of a third model training unit 43, configured to train another first model,
comprising:
a third extraction subunit 431, configured to extract the current text
information sample
and the preamble information sample of the video sample, wherein the described
preamble
information sample includes a preamble text information sample and a preamble
speech
information sample;
a third training subunit 432, configured to train and obtain the described
fourth sub model
by an input of the described current text information sample and the preamble
information sample
17
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
and an output of that if the current text information sample has changed
compared to the preamble
information sample, then extract the fourth intermediate output by the
described fourth sub model;
the third training subunit, further configured to train and obtain the
described fifth sub
model by assigning the first intermediate output and the current text
information sample as an input
and the emotion type as an output, then extracting the fifth intermediate
output by the fifth sub
model; and
the third training subunit, further configured to train and obtain the
described sixth sub
model by an input of the second intermediate output, the current text
information sample, and the
emotion type or a user-desired emotion type as an input, and an output of the
emotion information
by the third model.
In some preferred embodiments, the described second model training unit 41
further
includes:
a pre-processing subunit 414, configured to divide the described video image
sample into
several segmented video image samples according to a pre-set time interval,
then define the texts
within any time interval as the current time information sample, and defined
the texts before the
mentioned any time interval as the preamble text information sample.
To clarify, when the deep learning-based emotional speech synthesis service is
invoked in
the deep learning-based emotional speech synthesis device in the
aforementioned embodiments,
the described functional module configurations are used for illustration only.
In practical
applications, the described functions can be assigned to different functional
modules according to
practical demands, wherein the internal structural configuration of the device
is divided into
different functional modules to perform all or a portion of the described
functions. Besides, the
aforementioned deep learning-based emotional speech synthesis device in the
embodiment adopts
the same concepts in the described deep learning-based emotional speech
synthesis method
embodiments. The described device is based on the implementation of the deep
learning-based
emotional speech synthesis method, whereas the detailed procedures can be
referred to the method
embodiments and are not explained in further detail.
Those skills in that art cam understand that all or a portion of the
aforementioned
embodiments can be achieved by hardware, or by hardware driven by programs,
stored on a
readable computer storage medium. The aforementioned storage medium can be but
not limited to
memory, diskettes, or discs.
18
Date Recue/Date Received 2022-03-10

CA 03154029 2022-03-10
The aforementioned technical proposals can be achieved by any combinations of
the
embodiments in the present invention. In other words, the embodiments can be
combined to meet
requirements of different application scenarios, wherein all possible
combinations are falling in
the scope of the present invention, and are not explained in further detail.
19
Date Recue/Date Received 2022-03-10

Representative Drawing

A single figure which represents the drawing illustrating the invention.

Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee and Payment History should be consulted.

Administrative Status

Title	Date
Forecasted Issue Date	Unavailable
(86) PCT Filing Date	2020-06-19
(87) PCT Publication Date	2021-03-18
(85) National Entry	2022-03-10
Examination Requested	2022-09-16

Abandonment History

There is no abandonment history.

Maintenance Fee

Last Payment of $100.00 was received on 2023-12-15

Upcoming maintenance fee amounts

Description	Date	Amount
Next Payment if small entity fee	2025-06-19	$100.00
Next Payment if standard fee	2025-06-19	$277.00

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

the reinstatement fee;
the late payment fee; or
additional fee to reverse deemed expiry.

Patent fees are adjusted on the 1st of January every year. The amounts above are the current amounts if received by December 31 of the current year.
Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Payment History

Fee Type	Anniversary Year	Due Date	Amount Paid	Paid Date
Application Fee		2022-03-10	$407.18	2022-03-10
Maintenance Fee - Application - New Act	2	2022-06-20	$100.00	2022-03-10
Request for Examination		2024-06-19	$814.37	2022-09-16
Maintenance Fee - Application - New Act	3	2023-06-19	$100.00	2022-12-15
Maintenance Fee - Application - New Act	4	2024-06-19	$100.00	2023-12-15

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
10353744 CANADA LTD.

Past Owners on Record
None

Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.

Documents

To view selected files, please enter reCAPTCHA code :

To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Filter

Download Selected in PDF format (Zip Archive)

Download Selected as Single PDF

Document Description	Date (yyyy-mm-dd)	Number of pages	Size of Image (KB)
Abstract	2022-03-10	1	28
Claims	2022-03-10	5	237
Drawings	2022-03-10	4	71
Description	2022-03-10	19	1,053
International Search Report	2022-03-10	4	146
Amendment - Abstract	2022-03-10	2	106
National Entry Request	2022-03-10	14	1,314
Representative Drawing	2022-06-08	1	14
Cover Page	2022-06-08	1	53
Request for Examination	2022-09-16	8	296
Correspondence for the PAPS	2022-12-23	4	149
Examiner Requisition	2023-12-12	4	216
Amendment	2024-04-12	107	10,421
Claims	2024-04-12	46	3,002

Language selection

Menus

English Abstract

French Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.

Patent 3154029 Summary

English Abstract

French Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.Requested information will be availablein a moment.Thank you for waiting.

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.