Patent 3151910 Summary

(12) Patent Application:	(11) CA 3151910
(54) English Title:	DEVELOPMENT OF VOICE AND OTHER INTERACTION APPLICATIONS
(54) French Title:	DEVELOPPEMENT D'APPLICATIONS VOCALES ET D'AUTRES APPLICATIONS D'INTERACTION
Status:	Compliant

Bibliographic Data

(51) International Patent Classification (IPC):	G10L 13/04 (2013.01) G10L 15/20 (2006.01) H04M 3/00 (2006.01)
(72) Inventors :	MCMAHON, JEFFREY K. (United States of America) NAUGHTON, ROBERT T. (United States of America) LAIDLAW, NICHOLAS G. (United States of America) DUNN, ALEXANDER M. (United States of America) GREEN, JASON (United States of America)
(73) Owners :	VOICIFY, LLC (United States of America)
(71) Applicants :	VOICIFY, LLC (United States of America)
(74) Agent:	SMART & BIGGAR LP
(74) Associate agent:
(45) Issued:
(86) PCT Filing Date:	2020-08-13
(87) Open to Public Inspection:	2021-02-25
Availability of licence:	N/A
(25) Language of filing:	English

Patent Cooperation Treaty (PCT):	Yes
(86) PCT Filing Number:	PCT/US2020/046201
(87) International Publication Number:	WO2021/034613
(85) National Entry:	2022-02-18

(30) Application Priority Data:

Application No.	Country/Territory	Date
16/544,375	United States of America	2019-08-19
16/544,508	United States of America	2019-08-19
16/544,527	United States of America	2019-08-19
16/816,535	United States of America	2020-03-12

Abstracts

English Abstract

Among other things, a developer of an interaction application for an enterprise can create items of content to be provided to an assistant platform for use in responses to requests of end-users. The developer can deploy the interaction application using defined items of content and an available general interaction model including intents and sample utterances having slots. The developer can deploy the interaction application without requiring the developer to formulate any of the intents, sample utterances, or slots of the general interaction model.

French Abstract

Entre autres, un développeur d'une application d'interaction pour une entreprise peut créer des éléments de contenu qui seront fournis à une plateforme d'assistant destinée à être utilisée dans des réponses à des demandes d'utilisateurs finaux. Le développeur peut déployer l'application d'interaction à l'aide d'éléments de contenu définis et d'un modèle d'interaction général disponible qui comprend des intentions et des énoncés échantillons avec des intervalles. Le développeur peut déployer l'application d'interaction sans que le développeur ait besoin de formuler des intentions, des énoncés échantillons ou des intervalles du modèle d'interaction général.

Claims

Note: Claims are shown in the official language in which they were submitted.

Claims:
1. A machine-implemented method comprising
enabling a developer of an interaction application for an enterprise to create
items of
content to be provided to an assistant platform for use in responses to
requests of end-users, and
enabling the developer to deploy the interaction application using the defined
items of
content and an available general interaction model comprising intents and
sample utterances
having slots, without requiring the developer to formulate any of the intents,
sample utterances,
or slots of the general interaction model.
2. The method of claim 1 in which the enabling of the developer to create
items of content
comprises presenting a graphical user interface enabling the developer to
create the items of
content without requiring the developer to engage in coding.
3. The method of claim 1 in which the intents of the general interaction
model comprise
abstract intents that cannot be mapped directly to specific content without
parsing and
interpretation of slots associated with corresponding sample utterances.
4. The method of claim 1 in which the sample utterances comprise abstract
characterizations
of requests that cannot be mapped directly to specific content without parsing
and interpretation
slots of the sample utterances.
5. The method of claim 1 in which one or more of the slots comprises an
open-ended slot
that requires parsing and interpretation in order to determine an appropriate
item of content
corresponding to an utterance of an end user.
6. The method of claim 1 in which the enterprise belongs to a particular
vertical market and
the method comprises enabling the developer to select a template configured
for developing
interaction applications for enterprises belonging to the particular market.
7. The method of claim 1 comprising enabling the developer to indicate one
or more
particular sample utterances of an end user for each intent.
8. The method of claim 1 comprising enabling the developer to customize the
general
interaction model by adding an intent or a sample utterance pattern or both.
9. The method of claim 1 in which the enabling of the developer to deploy
the interaction
application comprises enabling the developer to deploy the interaction
application for use with
36

two or more different assistant platforms without requiring any action by the
developer to
configure the interaction application for use with two or more different
assistant platforms.
10. An apparatus comprising
a processor, and
a memory storing instructions executable by the processor to
receive utterances including slots, the utterances having been derived by an
assistant platform from requests of end-users of interaction assistants,
apply each of the received utterances to a general interaction model to
determine
intents, the general interaction model comprising non-specific sample
utterances including open-
ended slots, and
forwarding the intents to an interaction application configured to find items
of
content for use in providing responses to the requests of the end users.
11. The apparatus of claim 10 in which the non-specific sample utterances
cannot be used
directly to find items of content without parsing and interpreting the open-
ended slots.
12. The apparatus of claim 10 in which the open-ended slots comprise
extended portions of
utterances of end-users including parameters having values and text elements
representing
context to be parsed and interpreted.
13. The apparatus of claim 10 in which the intents comprise non-specific
intents.
14. A machine-based method comprising
storing content configured to be used by two different interaction
applications in
generating responses to requests from users of interaction assistants, and
executing the two different interaction applications to respond to intents and
slot
information received from assistant platforms based on the requests from users
of interaction
assistants, the intents and slot information having been generated by
application of a single
general interaction model to the requests from the users.
15. The method of claim 14 in which execution of the two different
interaction applications
invokes the respective stored content.
37

16. The method of claim 14 in which the two different interaction
applications are associated
with a single enterprise.
17. The method of claim 14 in which the two different interaction
applications are associated
with two different enterprises belonging to a single vertical market.
18. The method of claim 14 in which the two different interaction
applications are associated
with two different enterprises belonging to two different vertical markets.
19. The method of claim 14 comprising executing at least one of the
different interaction
applications to respond to intents and slot information received from two
different assistant
platforms.
20. The method of claim 14 in which the single general interaction model
comprises non-
specific intents.
21. The method of claim 14 in which the single general interaction model
comprises open-
ended slots.
22. The method of claim 14 comprising updating the stored content without
changing the
single general interaction model.
23. A machine-based method comprising
expressing markup elements of a speech markup language string as a tree of
nodes, each
of the nodes corresponding to one of the markup elements of the string,
segmenting the tree of nodes into branches, the branches having the same first
node,
processing the branches,
re-expressing the speech markup language string based on results of the
processing of the
branches, and
storing the re-expressed speech markup language string for use in responding
to requests
of end-users of interaction assistants.
24. The method of claim 23 in which the speech markup language string is
expressed in
accordance with SSML.
25. The method of claim 23 in which the markup elements comprise tags of a
speech markup
language.
38

26. The method of claim 23 in which segmenting the tree of nodes into
branches comprises
identifying branches that may not be usable by an interaction assistant
platform that applies a
version of a speech markup language according to which the speech markup
language string is
expressed.
27. The method of claim 26 in which the branches may not be usable because
they contain
nodes that are not valid elements of the version of the speech markup language
applied by the
interaction assistant platform.
28. The method of claim 23 in which re-expressing the speech markup
language string based
on results of the processing of the branches comprises removing invalid nodes
of branches and
merging the branches including the branches from which the invalid nodes it
been removed.
29. The method of claim 23 comprising providing the re-expressed speech
markup language
string to an interaction assistant platform for use in a text to speech
presentation of a response to
an end user.
30. A machine-based method comprising
storing rules representing a particular version of a speech markup language
definition
applied by an interaction assistant platform,
applying the rules representing the particular version to validate a speech
markup
language string to be used in responses to requests of end-users of
interaction assistants
conforming to the particular version.
31. The method of claim 30 comprising storing rules representing a second
particular version
of the speech markup language definition applied by a second interaction
assistant platform, and
applying the rules representing the second particular version to validate a
speech markup
language string to be used in responses to request of end-users of interaction
assistants
conforming to the second particular version.
32. The method of claim 30 in which the speech markup language definition
comprises
SSML.
33. The method of claim 30 comprising providing the validated speech markup
language
string to the interaction assistant platform for use in presenting responses
to requests of end-
users.
39

34. The method of claim 30 in which applying the rules to validate the
speech markup
language string comprises expressing markup elements of a speech markup
language string as a
tree of nodes.
35. A machine-based method comprising
presenting a user interface enabling a developer to create speech markup
language strings
conforming to a speech markup language definition applied by a corresponding
interaction
assistant platform,
the user interface enabling the user to create markup language strings using
plain text and
graphical elements and without requiring the user to select or enter any
formal expressions of
markup elements of the speech markup language definition.
36. The method of claim 35 in which the user interface presents controls
for entering text to
be spoken to an end user by an interaction assistant.
37. The method of claim 35 in which the user interface presents controls
corresponding to
elements of the speech markup language strings associated with effects to be
applied or added to
one or more words of text to be spoken to an end user by an interaction
assistant.
38. The method of claim 35 in which the user interface presents controls
corresponding to
properties of elements of the speech markup language strings.
39. The method of claim 35 in which the user interface presents controls
corresponding to
selectable values of properties of elements of the speech markup language
strings.
40. The method of claim 35 in which the user interface presents controls
comprising icons
graphically representative of effects to be applied to one or more words of
text to be spoken to an
end user by an interaction assistant, properties of the effects, or values of
properties of the
effects.
41. The method of claim 35 in which the user interface displays graphical
indicators in line
with text words, the graphical indicators representing effects to be applied
to one or more of the
text words when the words are spoken to an end user by an interaction
assistant.
42. The method of claim 41 in which the graphical indicators comprise
backgrounds
displayed with the words to which the corresponding effects are to be applied.

43. The method of claim 42 in which the backgrounds are color-coded
according to the
effects to which they correspond.
44. The method of claim 42 in which effects to be applied to one or more
words can be
nested and the backgrounds are nested in accordance with the nesting of the
effects.
45. A machine-based method comprising
displaying to a user of a user interface a representation of a speech markup
language
string including plain text and graphical indicators of markup language
elements,
each of the graphical indicators being displayed in graphical association with
a portion of
the speech markup language string to which the corresponding one of the markup
language
elements applies.
46. The method of claim 45 in which the graphical indicators are displayed
in line with the
plain text.
47. The method of claim 45 in which the graphical indicators represent
audio effects to be
applied to one or more words of the plain text.
48. The method of claim 45 in which the graphical indicators represent
properties of audio
effects to be applied to one or more words of the plain text.
49. The method of claim 45 in which the graphical indicators comprise
backgrounds
displayed with words of the plain text to which the corresponding effects are
to be applied.
50. The method of claim 49 in which the backgrounds are color-coded
according to the
effects to which they correspond.
51. The method of claim 49 in which effects to be applied to one or more
words of the plain
text can be nested and the backgrounds are nested in accordance with the
nesting of the effects.
52. The method of claim 45 in which the user interface displays controls
enabling a
developer to display of the raw speech markup language strings and to display
the plain text and
graphical indicators representative of the effects to be applied to the text
in line.
53. A machine-based method comprising
receiving intents and slot information from an assistant platform based on
requests of
end-users of interaction assistants, the intents and slot information having
been derived by
natural language processing,
41

applying additional natural language processing to the intents and slot
information
received from the assistant platform, and
deriving additional information about the requests of the end users based on
the
additional natural language processing.
54. A machine-based method comprising
receiving utterances representing requests of end users of interaction
assistants,
comparing the received utterances with non-specific sample utterances of a
general
interaction model, and
accepting a variety of received utterances as matches for a given non-specific
sample
utterance.
55. The method of claim 54 comprising processing slot information of the
received
utterances to identify content to be used in responses to the requests.
56. The method of claim 55 comprising different items of content each of
which corresponds
only to one of the received utterances of the variety of received utterances
that are accepted as
matches.
57. A machine-based method comprising
presenting a user interface enabling a developer of an interaction application
to select
general utterance patterns for inclusion in the interaction application, each
of the general
utterance patterns spanning a set of one or more sample utterances that
correspond to the general
utterance pattern, the user interface exposing a set of available general
utterance patterns,
automatically applying machine learning techniques to stored sample
utterances, stored
general utterance patterns, or sample utterances proposed by developers of
interaction
applications, to identify additional general utterance patterns, and
including the additional general utterance patterns in the set of available
general utterance
patterns exposed by the user interface.
58. The machine-based method of claim 57 comprising
matching proposed sample utterances of developers with stored sample
utterances or
stored general utterance patterns to identify the additional general utterance
patterns.
42

59. The machine-based method of claim 57 in which the interaction
application is being
developed for an enterprise of a particular industry, and at least some of the
general utterance
patterns are available to developers of interaction applications for another
industry.
60. The machine-based method of claim 57 comprising
in response to the developer proposing a sample utterance for interaction
application,
automatically suggesting inclusion in the interaction application of a
particular general utterance
pattern.
61. The machine-based method of claim 57 comprising
identifying additional general utterance patterns for inclusion in the set
based on similar
sample utterances proposed by multiple developers of interaction applications
for enterprises in
the particular industry.
62. The machine-based method of claim 57 comprising
determining an intent of a sample utterance proposed by the developer; and
identifying a stored sample utterance or a stored general utterance pattern
having an
intent that matches the intent of the proposed sample utterance.
63. The machine-based method of claim 62 comprising
automatically suggesting a particular general utterance pattern for inclusion
in the
interaction application in response to determining that the intent of the
proposed sample
utterance does not match an intent of stored sample utterances or stored
general utterance
patterns.
64. The machine-based method of claim 57 comprising
identifying an industry for the interaction application based on a sample
utterance
proposed by the developer; and
identifying the additional general utterance patterns for inclusion in the set
of available
general utterance patterns based on the industry.
65. The machine-based method of claim 57 comprising
identifying an industry for a sample utterance proposed by the developer; and
43

automatically suggesting inclusion of the sample utterance to one or more
other
developers of interaction applications in the industry.
66. The machine-based method of claim 65 comprising
applying the machine learning techniques to the sample utterance proposed by
the
developer to identify the industry.
67. The machine-based method of claim 65 comprising
training the machine learning techniques for interaction applications in the
industry using
the proposed sample utterance.
68. The machine-based method of claim 57 comprising
forming the interaction application including at least one of the additional
general
utterance patters.
69. The machine-based method of claim 68 comprising deploying the
interaction application.
44

Description

Note: Descriptions are shown in the official language in which they were submitted.

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
DEVELOPMENT OF VOICE AND OTHER INTERACTION APPLICATIONS
This description relates to development of voice and other interaction
applications.
A typical interactive voice application or chat application, for example,
processes spoken or
written requests (or simply "requests") received from an end user through an
assistant platform
or bot platform such as Amazon Alexa or Google Assistant. (We sometimes use
the word
"assistant" in place of "assistant or bot".) The assistant processes each
request to determine the
end user's intent . The interaction application then uses the intent to
generate a response to be
spoken or displayed back to the end user or both. The work of the interaction
application is
implemented using an interaction model, endpoint business logic, and content
used for the
responses.
Interaction Model
The interaction model is an object that helps the assistant platform to
determine the intent of a
request from an end user. Often the interaction model is in the form of JSON
data including
intents, slots, and sample utterances. Sample utterances are text expressions
of utterances that the
interaction model expects to encounter in end user requests. Slots contain
parameter values
associated with requests and responses. Intents are the intentions of end
users that correspond to
their requests.
Endpoint business logic
The endpoint of an interaction application is the component that receives
information about end
user intents from the assistant platform and sends text information to the
assistant platform about
items of content to be used in responses. The information about a user request
includes the name
of the intent that a natural language processor of the assistant platform
matched to utterance of
the request and the values of any slots that were assigned by the assistant
platform in the process
of matching a received utterance from an end user with sample utterances of
the interaction
model. The endpoint business logic generally represents and implements what
the enterprise
wants to provide as responses to received intents. The endpoint business logic
is usually
implemented as a RESTful HTTP API or a server-less function. The main
functions of the
endpoint business logic are to execute processes that use the interaction
model, the intents, and
slot information to find appropriate items of content and execute business
logic to use for
responses to requests.
1

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
Content for responses
The content for the response that the endpoint business logic returns to the
assistant platform and
that is eventually spoken or displayed to the end user can be in the form of
text derived from a
speech-to-text process or media files or both.
The interaction model helps to identify the meanings of text phrases (spoken
or written) derived
from end user requests and maps the meanings to intents according to the
protocols that govern
one or more assistant platforms. Assistant platforms such as Amazon Alexa and
Google
Assistant, for example, use interaction models to provide abstract
representations for mapping of
spoken or written human words or phrases (which we together sometimes call
simply
"utterances") to specific functions (i.e., intents). An interaction model
(typically in the form of
JSON data) can comprise a hierarchical structure of intents utterances
slots.
An intent represents a function that is bound to one or more utterances. An
utterance may contain
one or more slots to represent dynamic values (for example, a time of day).
When an intent is
indicated by interaction of an end user with an interaction assistant (e.g.,
an Amazon Echo Dot),
information about the interaction (including the identified intent) is
delivered by the assistant
platform to the endpoint for additional processing. An endpoint is essentially
an application
having a collection of functions or methods that map to the intents defined
within the interaction
model. The endpoint's functions may contain references to items of content or
literal content (we
sometimes refer to the "items of content" and "literal content" simply as
"content") that becomes
part of the responses sent back to the assistant platform.
An interaction application is expected to implement interactions that are
conversational from the
end user's perspective. The developer's role is to impart to the interaction
application
information to enable it to correctly interpret intents and return appropriate
items of content for
responses to them. Typically a conversational interaction application is
developed using the
components described above and either a custom development process or a flow
designer
process (also known as a skill builder; we sometimes use the word "skill"
interchangeably with
"interaction application" or "app"). Both of these two approaches are based on
literal (one might
say "hard wired") connections between intents and utterances.
2

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
The custom development process
In typical development of a custom conversational interaction application or
app, the developer
creates everything from scratch and manually develops the interaction model,
endpoint business
logic, and content. The typical development process includes the following
sequence of steps:
1. Decide the interactions (requests and responses) the app should support.
2. Generate a unique intent for each interaction (request or response) of the
app
that can happen in a conversation with the end user.
3. Manually enter sample utterances expected for each intent. A sample
utterance
can be a word or phrase that the end user speaks or writes (an utterance) to
express an intent. The developer tries to enter a comprehensive set of all of
the
sample utterances that an end user might be expected to say or write to
express a
given intent.
4. Compile all the intents and their corresponding sample utterances into an
interaction model, directly mapping each sample utterance to its exact intent.
5. Create endpoint business logic that can receive from the interaction model
an
intent corresponding to an end user request based on matching the utterances
of
the request to the sample utterances of the intents created and compiled
previously.
6. Provide a process for returning stored content that is the exact match for
the
given intent.
For example, if the developer is building an app to enable an end user to ask
for the weather, the
interaction model structure might be (the word "samples" refers to sample
utterances; words in
brackets are slots that can have specific values depending on the request):
Intent:
name: "WelcomeIntent",
samples: ["open weather app", "talk to weather app"]
Intent:
name: "GeneralWeatherIntent",
3

CA 03151910 2022-02-18
WO 2021/034613
PCT/US2020/046201
samples: ["what is the weather?", "how is it outside", "how is the weather
today?"]
Intent:
name: "CityWeatherIntent",
samples: ["what is the weather in {City}?", "how is it outside in {City}",
"how is
the {City} weather today?"]
Intent:
name: "LatestNewsIntent",
samples: ["what is new?", "what is the latest?", "anything new?"]
For such an app, the endpoint business logic could be expressed as:
if(request.name == "WelcomeIntent") return "Welcome to the weather app, ask
about the weather."
if(request.name == "GeneralWeatherIntent") return ...
if(request.name == "CityWeatherIntent") return ...
The data flow at run time for the business logic endpoint would then be as
shown in figure 1.
Custom app development can require that:
1. Each app have its own specific interaction model and endpoint business
logic.
2. All parts of the app be manually coded.
3. The interaction model be created manually.
4. The interaction model be redeployment or recertified if a new intent or
sample
utterance is added to the interaction model.
5. A detailed utterance have an exact match to a sample utterance to respond
to a
request.
6. There be a large number of hand entered sample utterances per intent.
7. Slots that are specific and contextual.
The flow designer development process
The app that is the end result of the flow design development process is
similar to an app that
results from the custom application development process described above. Among
the
differences are that the flow design process provides tools that help to
automate the creation of
the interaction model and the simple endpoint business logic.
4

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
The typical process is:
1. Add nodes to a graphical user interface canvas presented to a developer.
2. Each node represents a specific intent that is added to the interaction
model.
3. For each node an administrative user (e.g., someone who is not technically
trained) can add sample utterances and give the node a name.
4. Add content and attach the content to the node as the response.
5. Connect the nodes on the canvas to extend the endpoint business logic for
continuing a conversation.
6. Store the created node structure in a database.
The resulting interaction model would look just like the one developed using
the custom process.
At run time, the execution of the business logic of the app when an intent is
received is to:
1. Search the database of nodes for the one that matches the intent associated
with
the request.
2. Find the content that is attached to that node and return that content as
the
response to the request.
Flow design development can require:
1. Recertification and redeployment of the app after each change to the flow
due
to the modification of intents and utterances.
2. Utterances to match sample utterances exactly for a node in the canvas.
3. The entire conversational app to be designed before deployment.
4. Extra effort to manage apps that are designed to handle, say, more than 20
intents.
5. Slots to be specific and contextual.
The flow design development process can potentially support multiple platforms
and doesn't
require custom coding of endpoint business logic.
Summary
In general, in an aspect, a developer of an interaction application for an
enterprise can create
items of content to be provided to an assistant platform for use in responses
to requests of end-
users. The developer can deploy the interaction application using defined
items of content and an
available general interaction model including intents and sample utterances
having slots. The
developer can deploy the interaction application without requiring the
developer to formulate any
of the intents, sample utterances, or slots of the general interaction model.

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
Implementations may include one or a combination of two or more of the
following features. The
enabling of the developer to create items of content includes presenting a
graphical user interface
enabling the developer to create the items of content without requiring the
developer to engage in
coding. The intents of the general interaction model include abstract intents
that cannot be
mapped directly to specific content without parsing and interpretation of
slots associated with
corresponding sample utterances. The sample utterances include abstract
characterizations of
requests that cannot be mapped directly to specific content without parsing
and interpretation
slots of the sample utterances. One or more of the slots includes an open-
ended slot that requires
parsing and interpretation in order to determine an appropriate item of
content corresponding to
an utterance of an end user. The interaction application is for an enterprise
that belongs to a
particular vertical market and the developer can select a template configured
for developing
interaction applications for enterprises belonging to the particular vertical
market. The developer
can indicate one or more particular sample utterances of an end user for each
intent. The
developer can customize the general interaction model by adding an intent or a
sample utterance
pattern. The developer can deploy the interaction application for use with two
or more different
assistant platforms without requiring any action by the developer to configure
the interaction
application for use with each of the two or more different assistant
platforms.
In general, in an aspect, a memory stores instructions executable by a
processor to receive
utterances including slots, the utterances having been derived by an assistant
platform from
requests of end-users of interaction assistants. Each of the received
utterance is applied to a
general interaction model to determine intents. The general interaction model
includes non-
specific sample utterances including open-ended slots. The intents are
forwarded to an
interaction application configured to find items of content for use in
providing responses to the
requests of the end users.
Implementations may include one or a combination of two or more of the
following features. The
apparatus of claim in which the non-specific sample utterances cannot be used
directly to find
items of content without parsing and interpreting the open-ended slots. The
apparatus of claim in
which the open-ended slots include extended portions of utterances of end-
users including
parameters having values and text elements representing context to be parsed
and interpreted.
The apparatus of claim in which the intents include non-specific intents.
In general in an aspect, markup elements of the speech markup language string
are expressed as
a tree of nodes. Each of the nodes corresponds to one of the markup elements
of the string. The
6

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
tree of nodes is segmented into branches. The branches have the same first
node. The branches
are processed. The speech markup language string his re-expressed based on
results of the
processing of the branches. The speech markup language string is re-expressed
for use in
responding to requests of end-users of interaction assistants.
Implementations may include one or a combination of two or more of the
following features. The
speech markup language string is expressed in accordance with SSML. The markup
elements
include tags of a speech markup language. The segmenting of the tree of nodes
into branches
includes identifying branches that may not be usable by an interaction
assistant platform that
applies a version of a speech markup language according to which the speech
markup language
string is expressed. The branches may not be usable because they contain nodes
that are invalid
elements of the version of the speech markup language applied by the
interaction assistant
platform. The invalid elements include invalid types of elements. The invalid
elements include
elements having invalid properties. The invalid elements include elements
having invalid values
of properties. The invalid elements include invalid types of children nodes.
The re-expressing of
the speech markup language string based on results of the processing of the
branches includes
removing invalid nodes of branches and merging the branches including the
branches from
which the invalid nodes it been removed. The re-expressed speech markup
language string is
provided to an interaction assistant platform for use in a text to speech
presentation of a response
to an end user.
In general, in an aspect, rules are stored representing a particular version
of a speech markup
language definition applied by an interaction assistant platform. The rules
representing the
particular version are applied to validate a speech markup language string to
be used in responses
to requests of end-users of interaction assistants conforming to the
particular version.
Implementations may include one or a combination of two or more of the
following features.
Rules are stored representing a second particular version of the speech markup
language
definition applied by a second interaction assistant platform. The rules
representing the second
particular version are applied to validate a speech markup language string to
be used in responses
to request of end-users of interaction assistants conforming to the second
particular version. The
speech markup language definition includes SSML. The validated speech markup
language
string is provided to the interaction assistant platform for use in presenting
responses to requests
of end-users. The application of the rules to validate the speech markup
language string includes
expressing markup elements of a speech markup language string as a tree of
nodes. The
7

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
application of the rules to validate the speech markup language string
includes identifying
elements of the string having invalid properties. The application of the rules
to validate the
speech markup language string includes identifying elements of the string
having invalid values
of properties. The application of the rules to validate the speech markup
language string includes
identifying elements of the string having invalid types of children nodes.
In general, in an aspect, the user interface is presented enabling a developer
to create speech
markup language strings conforming to a speech markup language definition
applied by a
corresponding interaction assistant platform. The user interface enables the
user to create markup
language strings using plain text and graphical elements and without requiring
the user to select
or enter any formal expressions of markup elements of the speech markup
language definition.
Implementations may include one or a combination of two or more of the
following features. The
user interface presents controls for entering text to be spoken to an end user
by an interaction
assistant. The user interface presents controls corresponding to elements of
the speech markup
language strings associated with effects to be applied or added to one or more
words of text to be
spoken to an end user by an interaction assistant. The user interface presents
controls
corresponding to properties of elements of the speech markup language strings.
The user
interface presents controls corresponding to selectable values of properties
of elements of the
speech markup language strings. The user interface presents controls including
icons graphically
representative of effects to be applied to one or more words of text to be
spoken to an end user
by an interaction assistant, properties of the effects, or values of
properties of the effects. The
user interface displays graphical indicators in line with text words, the
graphical indicators
representing effects to be applied to one or more of the text words when the
words are spoken to
an end user by an interaction assistant. The graphical indicators include
graphical icons
indicative of the nature of the effects. The graphical indicators include
graphical elements
identifying values of properties of effects. The graphical indicators include
backgrounds
displayed with the words to which the corresponding effects are to be applied.
The backgrounds
are color-coded according to the effects to which they correspond. The effects
to be applied to
one or more words can be nested and the backgrounds are nested in accordance
with the nesting
of the effects. The user interface displays controls enabling a developer to
select a display of the
raw speech markup language strings or a display of the text and graphical
indicators
representative of the effects to be applied to the text in line.
8

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
In general, in an aspect, a representation of the speech markup language
string is displayed to a
user of a user interface, including plain text and graphical indicators of
markup language
elements. Each of the graphical indicators it is displayed in graphical
association with a portion
of the speech markup language string to which the corresponding one of the
markup language
elements applies.
Implementations may include one or a combination of two or more of the
following features. The
graphical indicators are displayed in line with the plain text. The graphical
indicators represent
audio effects to be applied to one or more words of the plain text. The
graphical indicators
represent properties of audio effects to be applied to one or more words of
the plain text. The
graphical indicators include backgrounds displayed with words of the plain
text to which the
corresponding effects are to be applied. The backgrounds are color-coded
according to the
effects to which they correspond. The effects to be applied to one or more
words of the plain text
can be nested and the backgrounds are nested in accordance with the nesting of
the effects. The
user interface displays controls enabling a developer to display of the raw
speech markup
language strings and to display the plain text and graphical indicators
representative of the
effects to be applied to the text in line.
In general, in an aspect, content is stored that is configured to be used by
two different
interaction applications in generating responses to requests from users of
interaction assistants.
The two different interaction applications our executed to respond to intents
and slot information
received from assistant platforms based on the requests from users of
interaction assistants. The
intents and slot information have been generated by application of a single
general interaction
model to the requests from the users.
Implementations may include one or a combination of two or more of the
following features. The
execution of the two different interaction applications invokes the respective
stored content. The
two different interaction applications are associated with a single
enterprise. The two different
interaction applications are associated with two different enterprises
belonging to a single
vertical market. The two different interaction applications are associated
with two different
enterprises belonging to two different vertical markets. At least one of the
different interaction
applications is executed to respond to intents and slot information received
from two different
assistant platforms. The single general interaction model includes non-
specific intents. The
single general interaction model includes open-ended slots. The stored content
is updated
without changing the single general interaction model.
9

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
In general, in an aspect, intents and slot information are received from an
assistant platform
based on requests of end-users of interaction assistants. The intents and slot
information have
been derived by natural language processing. Additional natural language
processing his applied
to the intents and slot information received from the assistant platform.
Additional information
about the requests of the end users is derived based on the additional natural
language
processing.
In general, in an aspect, utterances are received representing requests of end
users of interaction
assistants. The received utterances are compared with non-specific sample
utterances of a
general interaction model. A variety of received utterances are accepted as
matches for a given
non-specific sample utterance.
Implementations may include one or a combination of two or more of the
following features. The
slot information of the received utterances is processed to identify content
to be used in
responses to the requests. Each of the different items of content corresponds
only to one of the
received utterances of the variety of received utterances that are accepted as
matches.
In general, in an aspect, a user interface enables a developer of an
interaction application to
select general utterance patterns for inclusion in the interaction
application. Each of the general
utterance patterns spans a set of one or more sample utterances that
correspond to the general
utterance pattern. The user interface exposes a set of available general
utterance patterns.
Machine learning techniques are automatically applied to stored sample
utterances, stored
general utterance patterns, or sample utterances proposed by developers of
interaction
applications, to identify additional general utterance patterns. The
additional general utterance
patterns in the set of available general utterance pattern our exposed by the
user interface.
Implementations may include one or a combination of two or more of the
following features.
Proposed sample utterances of developers our matched with stored sample
utterances or stored
general utterance patterns to identify the additional general utterance
patterns. The interaction
application is being developed for an enterprise of a particular industry, and
at least some of the
general utterance patterns are available to developers of interaction
applications for another
industry. In response to the developer proposing a sample utterance for
interaction application,
automatic suggestion is made to include a particular general utterance pattern
in the interaction
application. Additional general utterance patterns are identified for
inclusion in the set based on

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
similar sample utterances proposed by multiple developers of interaction
applications for
enterprises in the particular industry.
These and other aspects, features, implementations, and advantages (a) can be
expressed as
methods, apparatus, systems, components, program products, business methods,
means or steps
for performing functions, and in other ways, and (b) will become apparent from
the following
description and from the claims.
Description
Figure 1 is a flow diagram.
Figure 2 is a block diagram.
Figure 3 presents flow information.
Figures 4 through 8 are trees.
Figures 9 through 13 are screenshots.
Figures 14 through 16 are schematic diagrams.
Here we describe an improved interaction application development platform and
process, which
we sometimes call simply the "development platform". The development platform
has a variety
of features that make development of interaction applications fast, easy,
adaptable, scalable, and
convenient, among other advantages.
Content-first
One feature of the development platform is its use of a "content-first" (or
content-centric)
development approach. The content-first development approach gives priority to
the aspects of
the app development and deployment process that involve development of content
and
management of relationships between end-user requests and responses.
General interaction model
Another aspect of the development platform is that, instead of requiring a
developer or
administrator to manually create an entire interaction model (directly or
indirectly), the
development platform provides a pre-populated general interaction model that
can handle almost
any end user request without input from the developer or administrator. As
described later, the
11

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
development platform can be customized by the user to handle any unusual
requests. The general
interaction model is abstract and globally applicable to a wide variety of
existing and future
assistant platforms, a wide variety of enterprises within a given vertical
market, and in a wide
variety of vertical markets.
As an example, the following hard-coded interaction model can support only two
user requests:
Welcome and Weather.
Intent: {
name: "WelcomeIntent",
samples: ["open weather app", "talk to weather app"]
Intent: {
name: "GeneralWeatherIntent",
samples: ["what is the weather?", "how is it outside", "how is the weather
today?"]
Intent: {
name: "WelcomeIntent",
samples: ["open weather app", "talk to weather app"]
The development platform's general interaction model, by contrast, can manage
Welcome,
Weather, and several other user requests due to the abstract nature.
Intent: {
name: "VoicifyGeneralQuestionIntent",
samples: ["what is the {Query}?", "how is {Query}"]
To demonstrate, the abstract utterance pattern of "what is the {Query}" can
handle user requests
that follow the abstract utterance pattern where the {Query} value can be
dynamically
determined.
12

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
Graph traversal and content index searching
Another aspect of the development platform is that the way the general
interaction model
determines where to search for content for use in a response to a request is
based on graph
traversal followed by content index searching. Certain features of such graph
traversal and
content index searching are described in more detail in United States patent
applications
16/000,805, 16/000,799, 16/000,789, 16/000,798, and all filed on June 5, 2018,
and issued
United States patent 10,235,999, which are incorporated here by reference in
their entirety.
Question and answer development example
It is common for interaction applications to define a collection of questions
and answers to reply
to end-user requests (questions) by appropriate responses (answers). It is
like a collection of
frequently asked questions (i.e., FAQ's) within a website only handled by
voiced answers to
voiced questions. In typical cases for which the requests are expected to be
questions and the
responses will be answers to the questions, the basic process of creating a
specific interaction
model for an app using the development platform is simple and includes three
steps:
1. Invoke a template type appropriate for the specific interaction model. For
example, the template for a question and answer represents an object that
consists
of a collection of sample utterance phrases corresponding to the question and
a
content response corresponding to the answer.
2. Enter and store items of content for the template type. Using the example
above, a user would enter content that represents the response (answer) to the

question.
2. Enter and store a few ways someone can ask a question (sample utterances).
Using the entered content and questions and information contained in the
template, the
development platform has enough information to automatically process and
generate a response
to essentially any type of request an end user might pose and handle
variations of utterances that
don't require exact matching. For example, end-user requests that use the
general utterance
pattern "how do I {Query}?" will map to a single intent within the development
platform's
general interaction model. The development platform uses the value of {Query}
to search for a
content match that will provide a suitable answer to both the general "how do
I" part of the
request and the specific {Query} part of the request. Because {Query} can have
a wide range of
13

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
specific values representing a variety of implicit intents, the use of the
general utterance pattern
support a wide range of requests. In some implementations, this simple process
is possible
because the general interaction model of the development platform includes
relatively fewer
different intents (utterance patterns) than would be used for the specific
interaction model in
custom development or flow designer development, and the general interaction
model uses open-
ended slots as explained below. Said another way, the general utterance
pattern represents a
range of possible specific intents all falling within the notion of a
generalized intent, and the
{Query} value can be used in the course of the processing of the interaction
model to
disambiguate exactly which specific intent within the generalized intent was
meant by the end
user's request.
Open-ended slots
The general interaction model uses open-ended slots that can be fulfilled by
full sentences rather
than small phrases or individual words. For example, a sample utterance and
its open-ended slot
might be represented as: "I want to {Query}" in which the word Query
represents the open-
ended slot. This generalized sample utterance yields an intent match for a
variety of requests but
not limited to "I want to buy a computer", "I want to learn about fishing", "I
want to know what
the weather is". The requests represent vastly different intents of the end
user but are represented
by a single sample utterance pattern.
Slots that are more open-ended are possible because, at run time, the
interaction application can
use its data flow including graph traversal and content search to match each
request to the proper
content to be used in the response. And because the values for the slots that
are in the request
contain full sentences and phrases, the interaction application can do
additional secondary
natural language processing, such as keyword extraction and variable
extraction. (for example,
the interaction application will search for the phrase "buy a computer" based
on the request "I
want to buy a computer") even after the assistant platform has done its
primary natural language
processing on the request before the request is received by the endpoint from
the assistant
platform.
Because the general interaction model is simplified using fewer, but pattern-
based open-ended
(e.g., abstract or general) intents and sample utterances, the development
platform can use the
same general interaction model for many different interaction applications
(being developed for
example, across competitors in a vertical market and across different vertical
markets). Each of
14

CA 03151910 2022-02-18
WO 2021/034613
PCT/US2020/046201
the interaction applications can also include enhancements to the general
interaction model if
necessary without affecting the underlying generic interaction model.
We use the term "open-ended" (with reference to, e.g., slots, intents, and
utterance patterns)
broadly in the sense, for example, element to which it refers is abstract,
generalized, spans
potentially multiple varied instances, universal, or otherwise generic.
Example general interaction model
An example (very simplified) general interaction model for an app (interaction
application)
developed using the development platform app is:
"interactionModel":
"languageModel":
"invocationName": "voicify labs",
"intents": [
"name": "AMAZON.FallbackIntent",
"samples": []
"name": "AMAZON.CancelIntent",
"samples": [
"I'm all set"
"name": "AMAZON.HelpIntent",
"samples": []
"name": "AMAZON.StopIntent",
"samples": [
"Quit",
"Goodbye"

CA 03151910 2022-02-18
WO 2021/034613
PCT/US2020/046201
"name": "VoicifyGeneralQuestionIntent",
"slots": [
"name": "Query",
"type": "AMAZON.SearchQuery"
],
"samples": [
"Show me {Query}",
"Do you have {Query}",
"Give me a {Query}",
"Give me an {Query}",
"Tell me {Query}",
"Are there {Query}",
"Do I {Query}",
"How does {Query}",
"Where did {Query}",
"What were {Query}",
"Help me {Query}",
"Is there {Query}",
"Where's {Query}",
"Where is {Query}",
"For a {Query}",
"Can I {Query}",
"I {Query}",
"I am {Query}",
"I would {Query}",
"I want {Query}",
"How can I {Query}",
"Who are {Query}",
"What are {Query}",
16

CA 03151910 2022-02-18
WO 2021/034613
PCT/US2020/046201
"Tell me about {Query}",
"What is {Query} ",
"What's {Query}",
"How did I {Query}",
"How to {Query}",
"How should {Query}",
"What does {Query} ",
"What's on {Query} ",
"What is on {Query}",
"Are there any tweets {Query}",
"Did anyone tweet {Query}",
"Give me the {Query} ",
"Create a {Query
The VoicifyGeneralQuestionIntent can be illustrated by the following sample:
Intent: {
name: "VoicifyGeneralQuestionIntent",
samples: [`What is{Query}", "How does {Query}", " ..
Given this portion of the general interaction model, the end user's utterance
of "What is the
weather like in Boston today" when applied to the general interaction model
would match the
abstract first sample utterance (what is?) for the intent and would send to
the endpoint the
"general question intent" and include the phrase "the weather like in Boston
today" in the
{Query} slot. The intent is abstract and general in that it only entails an
indication of an
extremely broad class of question, such as What? or How? The general question
intent is not
specific in any other respect.
17

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
By contrast, for a specific interaction model generated in a custom app
development process or
using a flow designer development process, the endpoint would only receive the
specific value
"Boston" in the {City} slot of the request.
Naturally, in the context of a conversational interaction application
involving questions and
answers, the "general question intent", a well-phrased comprehensive set of
sample utterances,
and an appropriate graph of corresponding nodes to be traversed, should enable
the endpoint
business logic to handle essentially any incoming question initiated by an end
user. The lengthy
query slots of the sample utterances provide information enabling the endpoint
to traverse the
graph and find the content that is appropriate for any of a variety of
possible slot values.
Because the endpoint of the development platform can receive more of the full
phrase ("the
weather like in Boston today" versus "Boston"), the endpoint can apply
processes to the fuller
(and potentially more complicated or nuanced) phrase enabling it to understand
the request more
completely and effectively than if it received only an intent name and a city
slot. These processes
may include additional natural language understanding, key word extraction,
sentiment analysis,
content search, and analytics processing. These types of additional processing
generally are not
possible without the availability of the longer phrase or expression.
The endpoint business logic for reaching the right content for a given request
then follows the
data flow discussed in the previously cited patent applications and patent and
as shown in figure
2. This process involves the native assistant platform sending the request
data to the endpoint of
the interaction application which then goes through the following steps to
determine the response
to return:
1. Validation of the request received.
2. Graph traversal to determine the expected content area to search.
3. Content search against indexed content created by the content management
system user.
4. Additional processing of the request and response such as 3rd party
webhook requests and
analytics tracking.
5. Building the response to return to the native assistant platform.
General Interaction Model Editing
When an interaction application is created on the interaction platform, it is
given a base
interaction model with several intents formatted by the host of the
interaction platform. Many
applications will make use of this interaction model as is, and never need to
update it
18

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
While many developers will not necessarily need to update or enhance the
development
platform's general interaction model because its intents and sample utterances
are abstract and it
has broad applicability, more advanced applications may require additional
control over the
general interaction model. For these cases, the interaction platform enables
users to change the
development platform interaction model directly and allows the platform to
translate changes to
a specific interaction model automatically during deployment. This process is
described in figure
3.
These updates and changes are not applied to change the base interaction model
directly. Instead,
updates and changes to the base interaction model as stored as sets of
discrete changes. Each set
of changes is timestamped to preserve the history and chronology of the
changes.
As shown in figure 3, in a simple example, a base general interaction model
302 provided by the
interaction platform can handle Intent 1 (304) and Intent 2 (306). As
discussed earlier, each
intent comprises a number of sample utterances that an end-user might say to
indicate an intent
to trigger a feature of an interaction assistant and will have between zero
and many slots which
allow specific data values to be extracted from an utterance.
A developer can enhance the base general interaction model by defining an
update to an existing
intent, such as the Intent 1 Update 308. Such an update could include editing
an intent by adding
or removing sample utterances. In some cases the host of the platform can
update an intent of the
base general interaction model such as the Intent 2 Update (310). In some
instances, a developer
can add a New Intent 312 to the base general interaction model.
In addition, the interaction platform can help the developer identify changes
to the base general
interaction model that are rejected by a particular assistant platform. The
interaction platform
tracks successes 314 and failures of deployments and so can trace issues more
specifically to
particular changes or updates instead of having to rely on information for
successes and failures
of a single entire interaction model. In the diagram, Update 1 did not prevent
a successful
deployment, but Update 2 caused the deployment to fail.
The interaction platform provides information to developers and enterprises
about the history of
changes and updates to particular interaction models This information offers a
number of
opportunities for improving management of interaction models and their
development and
deployment.
19

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
For example, a developer may apply updates and find that the performance of
the application has
been affected negatively. Because the interaction platform maintains a history
of the structure
and elements of the interaction platform at each successive deployment, the
developer easily can
undo the most recent interaction model update and redeploy the previously
deployed version.
When an interaction model is to be deployed to a particular assistant
platform, it is first flattened
into a deployment interaction model by applying all of the previously defined
changes and
updates. From there, it's converted to specific interactive models 320, 322,
324 that conform to
the requirements of respective assistant platforms and deployed.
In some implementations, the interaction platform may determine (for example,
through
automated inspection of repeated developer updates) that particular intents
are worth updating
for all interaction models for all interaction applications. In these cases,
administrative updates
can be made automatically (or with human assistance) across all interaction
models to add,
remove, or edit one or more intents.
SSML (Speech Synthesis Markup Language) Processing and Managing
Assistant platforms such as Amazon Alexa and Google Assistant can respond to
end-user
commands or statements (i.e., requests) by presenting audible readouts of
text. The audible
readouts are audio files generated by the assistant platform based on text
provided by the
interaction application developer in the content items of the interaction
application. The audible
readouts (generated audio files) leverage computer generated voices hosted by
the assistant
platform that are designed to sound like a human. While the voices are meant
to sound human-
like, the voice assistant typically recites the text provided in the response
from the endpoint at a
consistent pace and exhibiting little intonation or varied emphasis on words.
To provide more human-like qualities to the readout of text, assistant
platforms support a mark-
up language called Speech Synthesis Markup Language (SSML). SSML allows an
interaction
application developer to specify effects to be applied to text that will be
read out by the assistant
platform. At its core, SSML is a programming markup language specification
based on XML
with implied node types that represent "plain-text". The markup language is
used to tell a speech
synthesis engine (hosted by the assistant platforms) how to create an output
audio file from the
text provided in the response from the endpoint. The SSML file is used to
adjust elements of the
speech such as:
Pronunciations of words

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
Effects to be applied to words or phrases, including
Speed/Rate of speech
Pitch
Volume
Emphasis
Pauses to be added
Mixing of audible speech with recorded audio clips
The full official W3C specification of SSML is available here:
https://www.w3.org/TR/speech-
synthesis11/.
There are some inherent challenges with SSML when applied to assistant
platforms. Some of the
challenges are based on loose adoption of SSML standards by assistant
platforms. As an
example, while Amazon Alexa and Google Assistant both support SSML, they do
not support all
SSML tags consistently, and in some cases, some tags are not supported at all.
Additionally, since SSML is an XML based programming language, it is applied
using a
hierarchical representation of tags. It borrows techniques similar to those
used by HTML (Hyper-
Text Markup Language) for screen-based output of web pages, but the mark-up of
SSML is
applied to audible output. While it is fairly easy to provide a graphical
interface to enable an
interaction application developer to apply SSML to text, it is challenging to
create a graphical
interface (for example, one suitable for non-technical users) that visually
and intuitively
represents how SSML tags will affect audible output.
The development platform that we describe here offers an effective way to
manage the
challenges referenced above. Among the features of the development platform
are the following:
1. SSML is segmented and parsed into distinct parts for additional processing
to support
functionality such as text-to-speech.
2. SSML is validated using customizable rules and detailed errors. The results
of the validation
offers insight into compatibility across multiple assistant platforms (e.g.,
Google Assistant and
Amazon Alexa).
3. SSML mark-up can be visually (graphically) edited without needing to know
the structure,
hierarchy, code, or rules about it.
21

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
SSW- Processing
All three of these features and others are made possible by processing an SSML
string into an
object tree structure, applying custom logic to the tree structure, and
processing the object tree
structure back into an SSML string.
In the first step, the processor pre-processes the SSML string into formatted
parts that can be
parsed, then scans the segmented SSML string for XML nodes and plain text and
forms them
into a tree haying many parent-child relationships. Each node in the tree has
properties like the
SSML element name, the attributes of the SSML element, and a reference to all
its children.
For example, the following SSML string would be preprocessed and then turned
into the
subsequent tree structure represented in code.
Raw SSML string:
"<speak>This is my plain text <emphasis level=\"strone>with some emphasis
here</emphasis>. And an audio clip here <audio src=\"httus://a-url.com/an-
audio-
file.mp3\"/><prosody speed=\"+50%\">with some effects and say-as <say-as
interpret-
as=\"digits\">123</say-as></speak>"
The preprocessing step then produces the following formatted pre-processing
SSML:
<speak>
This is my plain text <emphasis level="strong">with some emphasis
here</emphasis>.
And an audio clip here <audio src="httus://a-url.com/an-audio-file.mp3"/>
<prosody rate="+50%">
with some effects and say-as
<say-as interpret-as="digits">123</say-as>
</prosody>
</speak>
In the next step, the preprocessed and formatted SSML is parsed to produce the
Processed SSML
Data Structure shown in figure 4.
22

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
The structured tree of figure 4 can be processed to recover the original SSML
string by
generating the property XML elements and attributes based on the name and
properties of the
node as well as the following children of each node. Because recovering the
original SSML
string is straightforward, nodes can be added to the data structure and then
be represented in the
recovered SSML string that can be processed by the native assistant platforms
or text-to-speech
services. In other words, manipulations can be performed when the SSML string
is expressed in
the data structure and then returned to an SSML format typically expected by
the assistant
platforms or text-to-speech services.
Validation and Rule Engine
Using this tree structure, the development platform is able to validate the
underlying SSML
against a set of rules. Among other things, the rules can be customized to fit
differences between
how different platforms support different elements of SSML. For example, Alexa
supports the
<voice/> element while Google does not, and Google has a different min and max
speed value
for the <prosody/> element than does Alexa.
A set of rules or "rule book" can be generated for each assistant platform to
which SSML strings
of the development platform will be provided. A set of rules may have the
following:
A list of SSML elements supported by that assistant platform
A subset of rules for each element
Allowed properties of the element
Allowed values of those properties
Min/Max values
Exact values
Allowed units for those property values
Allowed child element types
A maximum number of elements in the complete string
The validation process traverses the tree beginning at the first node(s). The
validation process
validates each node by:
Checking that the element type is in a list of supported element types
If it is not, the validation process will return an error stating that the
specific element
is not valid
Check each of the properties of the node against the allowed properties for
that type of
element
If there is a property that is not allowed, the validation process will return
an error
stating the property that is not allowed on the specific element
Check the values of each of the properties of the node against the allowed
values for that
property
23

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
If there is a value that does not fit the min/max or exact allowed values, the
validation
process will return an error stating the exact element, exact property, and
the actual
allowed values and why the given value did not fit
Check the unit of each property value against the allowed unit types of that
property
If there is a given unit that as a property value that is not valid, the
validation process
will return an error stating the given unit, property, and element that does
not allow it
Check that the node' s immediate children are among the child types allowed
four the node
If there are any children nodes that are not in the allowed child types, the
validation
process will return an error with the name of the child type that is not
allowed for the
specific node type.
Check each of the node's children against the same set of logic above until
there are no
elements of the tree left to check and all of the checked elements comply with
the rules, at
which point the tree is considered valid.
Examples
Valid types:
A given rule book has the allowed elements of: speak, say-as, prosody, break,
audio
The provided SSML string is: <speak>this is text <yell>this is more
text</yell></speak>
The validation process will return an error saying: "yell is not a supported
SSML
element".
Valid properties:
A given rule book has the allowed type of: prosody
Which has the allowed properties of: rate, pitch, volume
The provided SSML string is: <speak>this is text <prosody
emphasis="loud">this is more text</prosody></speak>
The validation process will return an error saying: "emphasis is not a
supported
property of the prosody type"
Valid property values:
A given rule book has the allowed type of: prosody
Which has the allowed property of: rate
With the allowed values of: >-50% and <+200%
The provided SSML string is <speak><prosody rate="-80%">this is
slow</prosody></speak>
24

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
The validation will return an error saying: "-80% is below the minimum value
of -
50% for the rate property of prosody"
Valid property units
A given rule book has the allowed type of: prosody
Which has the allowed property of: rate
With the allowed unit of: %
The provided SSML string is <speak><prosody rate="+100dB">this is
loud</prosody></speak>
The validation will return an error saying: "dB is not a valid unit for the
rate
property of prosody, the only allowed unit is %"
Valid child types:
A given rule book has the allowed type of: say-as
Which has the allowed child types of: plain-text
The provided SSML string is <speak><say-as interpret-as="address">33 Arch
Street, <emphasis level="strong">Boston</emphasis>, MA</say-as
></speak>
The validation will return an error saying: "say-as does not allow the
emphasis
element"
SSMI.. Segmentation
Because certain voice assistants and text-to-speech tools support different
SSML elements, and
sometimes different properties for corresponding supported elements, the
development platform
can adjust incompatible SSML for one assistant platform so that it is
supported, by segmenting
disallowed parts of the tree.
The segmentation process involves selecting an SSML element type that is not
allowed and
removing it from the tree without disturbing that element's children. It is
important not to affect
the children in order to maintain any other allowed "parent" effects that are
applied to the
children. The segmentation process leaves the original tree broken into
multiple trees depending
on the number of places it needs to perform segmentation.

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
For example, Alexa supports the <voice> tag that Google does not support. So,
if desired, the
development platform can segment SSML elements that use the <voice> for
compatibility with
Google and remove those elements while keeping other effects.
Consider the following SSML string:
<speak>
This is text
<prosody volume="+2dB">
Loud text
<voice name="Brian">
This is text too
</voice>
</prosody>
</speak>
which has the tree representation shown in figure 5.
The development platform would segment based on the <voice> element and create
two trees
(separated segments or branches) as shown in figure 6.
In these two separated segments, the development platform has divided the
original tree into
elements that are fully valid on the left segment, and what would be invalid
on the right segment.
The segmentation process can then either proceed with just the left branch or
it could alter the
right branch to remove the <voice> element resulting in the two trees
(segments, branches)
shown in figure 7
Now both trees will be considered valid and therefore can be merged back
together into a single
valid tree as shown in figure 8.
Now that the new valid tree has been constructed, the development platform can
re-assemble it
back into a valid SSML string resulting in:
<speak>
This is text
26

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
<prosody volume="+2dB">
Loud text
This is text too
</prosody>
</speak>
This string can be provided to the Google assistant platform without violating
Google's
interpretation of the SSML definition.
The segmenting process also works recursively, so if there are "N" number of
nested elements
that need to be segmented, it can be broken into "N" + 1 trees and
appropriately re-assembled by
shifting the segmented and removed elements' children up to their original
parent elements.
The segmentation process can then be used in tandem with the rule engine to
automatically
generate a valid SSML string from an invalid SSML string by segmenting the
original tree where
the rules are broken.
The segmenting process can also be applied separately to allow for using the
separated trees to
run custom logic. For example, some text-to-speech services support the
<audio> element while
others don't. So when trying to generate audio files from the SSML that has
<audio> elements,
the segmentation engine can segment the trees separately, then generate the
output speech audio
files and keep the audio files separate but in order.
For example, consider the SSML string:
<speak>
<prosody rate="-20%">
this is slow
<audio src="httus://someurl.com/somefile.mp3"/>
This is still slow but comes after the audio
</prosody>
</speak>
The segmenting engine would break it into three different SSML strings:
27

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
<speak>
<prosody rate="-20%">
this is slow
</prosody>
</speak>
<audio src="httus://someurl.com/somefile.mp3"/>
<speak>
<prosody rate="-20%">
This is still slow but comes after the audio
</prosody>
</speak>
Using these three different strings, the development platform can process them
individually for
text-to-speech, resulting in three .mp3 files that can be played back to back
as one full
representation of the entire input.
Visual (Graphical) Tool for Representation and Editing of SSML
As shown in figures 9 and 10, in order to make the creation and editing of
SSML strings easy to
do by even a non-technical user without having to understand the elements,
rules, and code
formatting of SSML, the development platform includes a visual (e.g.,
graphical) editor (tool)
that comprises:
A visual (e.g., graphical) representation of the SSML structure using
Icon representation of each element type
Color representation of each element type
Shapes and nesting
A visual tool for adding SSML elements to a string and assigning values to
properties having
pre-configured settings
The visual tool (we sometimes use the term "visual" and the term "graphical"
interchangeably)
enables a user to add SSML effects to the output SSML string using a menu of
supported
28

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
options. Selecting each of the options causes the menu to be changed to
customize the
appropriate available properties of that element type.
For example, a user can highlight a word or phrase of text 100 that is part of
a sentence 102 to be
read in response to an end user of an interaction assistant. The user then can
wrap the highlighted
piece of text in, for example, a <prosody rate=".." element by opening an
SSML menu 104 and
selecting the Effect tool 106. The Effect tool, in this case, is one of three
tools (the other two
being "say as" 112 and "voice" 114 that are usable to alter highlighted
portions of the text. The
menu also enables the user to insert a break 108 or audio 110. When the Effect
tool is invoked,
the menu changes to the submenu 116 which presents icons associated with
properties of the
element type just selected. In this example, the icons in the submenu 116
include speed 118,
pitch 120, emphasis 122, and volume 124. If the user then invokes, for
example, the speed icon
118, the menu changes to the submenu 126. The user then can choose one of
several pre-selected
speeds 128 or can specify a custom speed value 130. Having made a selection or
specified a
custom speed value, the user can invoke an "Add" option 132 to cause the text
in the textbox to
be updated to reflect the new SSML element placed in its intended position in
the text.
As shown in figure 10, the visual representation of SSML presented by the
visual tool now
includes an icon 134 that graphically suggests or represents the effect the
SSML element will
add to the highlighted word or phrase of the SSML string. This icon presented
in the visual
representation matches the same icon 136 used for the element in the menu
options. The icon
also contains the word or phrase 138 that will be affected by the effect or
effects. As explained
below, the effects can be nested in the nesting relationship of effects to the
word or words
affected will be presented in an intuitive manner to the user.
For elements that have a scale value such as rate, pitch, volume, or emphasis,
for example, the
visual tool presents a small vertical value indicator 140 next to the icon to
show where the
current value 142 is on the scale. The user of the SSML visual tool can also
cause the pointer to
hover over the icon or the scale indicator to view a tooltip 144 explaining
the details of the
element including the name, value, and others. The user can then click the
tooltip to open the
SSML menu 145 for that element, where the user can edit the given value 147
and then invoke
the update control 164, or can remove the effect from the given text by
invoking the remove
control 166. As shown in the portion of the example 170, the visual tool
enables the user to read
the text and see the effects that will be applied and how those effects are
nested. In this case,
after the phrase "This speech will be read out" without any effect, the words
"slow and loud"
29

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
will be read out slowly and the words "and loud" will also be readout loudly.
After the word
"with" a break is inserted before the phrase "a break in between." As can be
seen from the
example, the user can easily add, edit, and remove one or more effects, and
nest them easily, and
can intuitively understand how the effects will be applied in the nested
fashion to words and
phrases in the sentence.
Along with the interactive icon and scale indicator, the visual representation
of the SSML
includes color coded backgrounds 146, 148 that correspond to given effects
(for example, speed
could always be represented by a pink color). These backgrounds also have
rounded "pill"
shaped ends 150, 152 to help indicate the start and end of a given effect.
These visual elements
(e.g., pill-shaped icons) can also be nested within each other to show how the
SSML elements
themselves are nested within one another. For example, a volume icon 154 may
be nested within
a speed icon 156. When an SSML element is nested as a child within another
SSML element, the
visual representation will add a small padding 158 to the end of the parent's
background "pill" to
show that the parent ends when the child element ends.
The visual tool includes a button called "show raw SSML" 160 that can be
invoked to show the
code version of the SSML string 162 including the markup in-line.
The visual representation can also be edited directly in-line just like a
normal textbox, including
removing SSML elements by backspacing, or deleting the "pill" entity in the
textbox.
Figures 9 and 10 show examples of each of the different stages of use of the
visual tool including
adding a new element to an SSML string having no original elements, nesting
elements within
each other with each of the elements having its own visual representation, and
how the hover and
edit states work with the menu re-opening to allow for making changes.
Other features
The development platform offers a single, abstract representation of an
interaction model that
enables building and managing a wide range of specific interaction models
based on a single
consistent format. Developers can rely on the single generic interaction model
or if necessary can
customize the interaction model within the development platform. The original
or customized
general interaction model that results from the developer's work can then
automatically be
translated to syntax required by assistant platforms such as Amazon Alexa and
Google Assistant
before deployment for use with the different assistant platforms.

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
Once a specific interaction application has been developed and deployed with
particular assistant
platforms, the same specific interaction application can be easily deployed to
other assistant
platforms based on the information stored within the development platform's
representation of
the general interaction model such as the intents, slots, patterns, and sample
utterances. For
example, a specific interaction application initially deployed to Amazon Alexa
and Google
Assistant, can later be deployed to Samsung Bixby based on the development
platform's
representation of the general interaction model. The platform's general
interaction model can be
translated and mapped to the Samsung Bixby structure to serve as an
interaction model and then
subsequently deployed to the Bixby platform using the specific interaction
application's
developed model.
The development platform's general interaction model leverages generalized,
abstract intents and
open-ended slot types that provide greater flexibility for utterance matching.
This greater
flexibility enables other features including that new content can be added
without requiring an
update to the general interaction model, and therefore without requiring re-
deployment or
recertification. The ability to create interaction applications without coding
enables a broad non-
technical user base to create voice, chat, and other interaction applications.
The development
platform also allows users to manage content without managing business logic,
whereas content,
business logic, and intents are tightly coupled in custom or flow-based tools.
The development platform can provide additional and custom natural language
processing to
supplement the natural language processing done by the assistant platform. One
reason is that the
platform does not require using explicit (non-abstract) intents having data-
type specific (non-
open-ended) slots, which are limited in functionality and provide less
conversational context that
can be processed at the development platform.
As shown in figure 11, in the developer's user interface 402 of the
interaction platform, the
interaction model page as shown provides a general question control 406.
Invoking this control
exposes a list 403 of other controls for individual sample utterance patterns
408. The first such
control 406 enables the user to add a new phrase (a new sample utterance
pattern). The developer
can enter the new phrase 410 in the box, can insert one or more slots by
clicking the control 412
and, when finished, the developer can click the add phrase control 414 to
cause the new phrase to
be added to the particular interaction model being worked on.
31

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
Each of the other controls for individual sample utterance patterns shows an
existing sample
utterance pattern 416 and provides a control 418 to enable the developer to
edit the phrase and a
trash control 420 to enable deleting the sample utterance pattern from the
interaction model
being developed. In the taskbar 422 on the left side of the page includes
icons 423 representing
voice applications that have been developed or are being developed for an
enterprise. Invoking
one of the icons will switch to the user interface to represent features of
the corresponding voice
application. The capabilities of the development platform are easily
extendable; users can update
an interaction model simply by providing new sample utterances, without any
need to create
additional intents and corresponding feature mappings.
For example, as shown in figure 13, the developer could add a new utterance
pattern 520 for the
phrase "where art thou {Query}".With this additional utterance pattern, the
developer can
configure any number sample utterances to handle questions that follow the
same pattern (in the
manner illustrated earlier); such as "where art thou Romeo" or "where art thou
my friend whom I
met last summer".
Because the development platform does not require an exact match of a spoken
phrase (an actual
end-user utterance) to a particular sample utterance, the platform can handle
thousands of unique
items of content with lower risk of conflicts.
Figure 12 illustrates a page 502 of the developer's user interface of the
interaction platform in
which the developer can design a portion of an interaction model. In the
example shown, the
interaction model being developed is to be used with an interaction
application involving job
openings. Here, the developer has entered three variations 504, 506, 508 of
different sample
utterances for a given intent. At run time, the interaction application would
find a match for this
intent for any of the following end-user requests: "Are there any sales
positions", "Are there any
sales jobs", "Are there any sales positions available today". The development
platform permits
entering a few sample utterances for the abstract intents rather than one
specific utterance for
every potential intent.
The development platform can automate additional utterance pattern suggestions
based on the
contents of a developer's utterance and machine learning based results from
collections of
utterance patterns within a specific industry.
As a developer enters a sample utterance the platform can recognize a pattern
represented by the
sample utterance based on the contents of the sample utterance. As an example,
if the developer
32

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
enters the sample utterance "Are there any sales positions available" (504),
the platform can infer
that this is an utterance of type General Question together with the key
phrase "sales positions
available". Based on this information the platform may suggest adding the
following sample
utterances to cover additional ways an end-user might ask the question about
sales positions:
"Is there a sales position available"
"Do you have any sales positions available"
"What are the sales positions available"
The suggested sample utterances are based on sample utterances within the
platform's standard
interaction model for the General Question type (402).
Automaticutterance pattern suggestion enables other features including the
following. The
development platform is able to take sample utterance variations from the
development
platform's stored utterances and compare them with the sample utterances of
the open-ended
intents (sample utterance patterns) to determine if the utterance variations
are valid or not (that
is, are explicitly associated with one of the sample utterance patterns). If
the sample utterance
variation does not fit an existing utterance pattern, then it might not be
found during content
search. To prevent this, the development platform can suggest adding a new
utterance pattern to
a customized interaction model based on the utterance variation. This
comparison is done using
the interaction model's utterance pattern, which contains a few words and then
a slot variable
and determining if the given new utterance variation fits within the utterance
pattern. If it does
not fit the exact pattern, the development platform can determine multiple
options of new
utterance patterns to add to the interaction model's set of utterance
patterns. This is done by
breaking down the entire new sample utterance into the individual words in the
new sample
utterance and then determining the most open-ended utterance pattern to add by
using 1-3 of the
first or last words in the expression that are either verbs or articles. It
then creates the new pattern
with the slot that would represent the rest of the phrase in the utterance
pattern.
For example: if a new sample utterance variation of "I would like to order a
box of cookies" was
added by a developer, but does not correspond to any of the existing sample
utterance patterns,
the development platform might suggest something like adding "I would {Query}"
to the sample
utterance patterns.
33

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
The platform can further simplify the creation of utterance patterns based on
Machine Learning
(ML) models trained on utterance patterns within specific industries. The
platform stores
interaction models (including the utterance patterns) and categorizes them
based on industry
vertical. The customized contents of interaction models within a given
vertical are used as
training data to determine suggested sample utterance patterns for new
interaction applications
within the same vertical.
Figure 14, for example, shows three known customers in the healthcare industry
A, B, and C. If
a majority of healthcare customers (say customer A and customer B in this
example) add a
common phrase 540 (sample utterance pattern) to their respective interaction
applications, the
development platform automatically recognizes a correlation between that,
sample utterance
pattern and a particular vertical industry (healthcare in this example), and
is able to begin
suggesting this sample utterance pattern for inclusion in interaction
applications being developed
by other customers in the industry.
For example, over time the interaction platform collects (identifies),
utterance patterns used
within interaction applications in the healthcare industry. ML models
determine that a majority
of the applications use distinct common utterance patterns for questions
related to ER visits:
"How long do I have to wait to get into the ER"
"What is the wait time for the ER"
"How busy is the ER"
If a developer for an enterprise in the healthcare industry creates a new
sample utterance using
one of the phrases above, the development platform will automatically suggest
to the developer
the use of additional utterance patterns based on ML results.
Figure 15 illustrates an example of customers adding the common sample
utterance "ER wait
times" which is then suggested 546 by the development platform to remaining
customers 548.
Figure 16 demonstrates how stored sample utterances 550 suggested by customers
A, B, and C
can be used to suggest similar sample utterances, even to developers of
enterprises (customer D)
who are not specifically known by the development platform to be in
healthcare. "ER wait
times" is aphrase that is likely healthcare related and, when the development
platform determines
that, other industry related phrases that are part of utterance patterns can
be suggested 552 for
use by the developer.
34

CA 03151910 2022-02-18
WO 2021/034613 PCT/US2020/046201
The development platform stores archived snapshots of the interaction model of
each interaction
application. The archived snapshots can be useful for a variety of purposes
for example as a
mechanism for version control and analysis of performance based on utterance
failures.
The development platform also uses a more traditional content form style of
managing content
which does not require a large canvas of intersecting items.
Because the development platform does not require custom coding or
implementation or design
of endpoint business logic, non-technical administrators create rich
conversational experiences
more easily while focusing on the content instead.
Other implementations are also within the scope of the following claims

Representative Drawing

A single figure which represents the drawing illustrating the invention.

Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee and Payment History should be consulted.

Administrative Status

Title	Date
Forecasted Issue Date	Unavailable
(86) PCT Filing Date	2020-08-13
(87) PCT Publication Date	2021-02-25
(85) National Entry	2022-02-18

Abandonment History

There is no abandonment history.

Maintenance Fee

Last Payment of $100.00 was received on 2023-08-04

Upcoming maintenance fee amounts

Description	Date	Amount
Next Payment if standard fee	2024-08-13	$125.00
Next Payment if small entity fee	2024-08-13	$50.00

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

the reinstatement fee;
the late payment fee; or
additional fee to reverse deemed expiry.

Patent fees are adjusted on the 1st of January every year. The amounts above are the current amounts if received by December 31 of the current year.
Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Payment History

Fee Type	Anniversary Year	Due Date	Amount Paid	Paid Date
Application Fee		2022-02-18	$407.18	2022-02-18
Maintenance Fee - Application - New Act	2	2022-08-15	$100.00	2022-08-05
Maintenance Fee - Application - New Act	3	2023-08-14	$100.00	2023-08-04

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
VOICIFY, LLC

Past Owners on Record
None

Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.

Documents

To view selected files, please enter reCAPTCHA code :

To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Filter

Download Selected in PDF format (Zip Archive)

Download Selected as Single PDF

Document Description	Date (yyyy-mm-dd)	Number of pages	Size of Image (KB)
Abstract	2022-02-18	2	79
Claims	2022-02-18	9	379
Drawings	2022-02-18	13	244
Description	2022-02-18	35	1,633
Representative Drawing	2022-02-18	1	30
Patent Cooperation Treaty (PCT)	2022-02-18	2	84
International Search Report	2022-02-18	3	161
National Entry Request	2022-02-18	6	165
Cover Page	2022-04-25	1	50

Language selection

Menus

English Abstract

French Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.

Patent 3151910 Summary

English Abstract

French Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.Requested information will be availablein a moment.Thank you for waiting.

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.