Patent 3065841 Summary

(12) Patent Application:	(11) CA 3065841
(54) English Title:	SYSTEM AND METHOD FOR ADAPTIVE DATA VISUALIZATION
(54) French Title:	SYSTEME ET PROCEDE DE VISUALISATION ADAPTATIVE DE DONNEES
Status:	Report sent

Bibliographic Data

(51) International Patent Classification (IPC):	G06F 3/0481 (2022.01) G06F 3/04842 (2022.01) G06F 17/00 (2019.01)
(72) Inventors :	WANG, LUYU (Canada) CAO, YANSHUAI (Canada)
(73) Owners :	ROYAL BANK OF CANADA (Canada)
(71) Applicants :	ROYAL BANK OF CANADA (Canada)
(74) Agent:	NORTON ROSE FULBRIGHT CANADA LLP/S.E.N.C.R.L., S.R.L.
(74) Associate agent:
(45) Issued:
(86) PCT Filing Date:	2018-05-08
(87) Open to Public Inspection:	2018-12-20
Examination requested:	2022-09-14
Availability of licence:	N/A
(25) Language of filing:	English

Patent Cooperation Treaty (PCT):	Yes
(86) PCT Filing Number:	PCT/CA2018/050545
(87) International Publication Number:	WO2018/227277
(85) National Entry:	2019-12-02

(30) Application Priority Data:

Application No.	Country/Territory	Date
62/518,386	United States of America	2016-06-12

Abstracts

English Abstract

An interactive data visualization system is provided that utilizes unsupervised learning process, to automatically choose the hyperparameters for generating insights, which are then used for visualizing the data using interactive plots that update dynamically in response to input control commands.

French Abstract

L'invention concerne un système interactif de visualisation de données qui met en oeuvre un processus d'apprentissage non supervisé pour choisir automatiquement les hyperparamètres en vue de générer des aperçus qui sont ensuite utilisés pour visualiser les données à l'aide de représentations graphiques interactives mises à jour de manière dynamique en réponse à des instructions de contrôle des entrées.

Claims

Note: Claims are shown in the official language in which they were submitted.

WHAT IS CLAIMED IS:
Any and all features of novelty or inventive step described, suggested,
referred to, exemplified, or
shown herein, including but not limited to processes, systems, devices, and
computer-readable
and ¨executable programming and/or other instruction sets suitable for use in
implementing such
features.
1. A data visualization system for generating one or more visualizations
indicative of
chaining of union or intersect of selections, the data visualization system
comprising:
a processor configured to process machine readable instructions to:
receive user files;
process the user files by applying an automatic hyperparameter selection;
generate interactive plots using the processed user files, the interactive
plots indicative of chaining of union or intersect of selections;
store the interactive plots;
generate an interface with visual elements indicating the interactive plots,
the interface having selectable indicia configured to be responsive to input
to select a data point or subset of data points of the interactive plots;
responsive to the selectable indicia, generate updated interactive plots
based on the selected data point or subset of data points;
store the updated interactive plots;
update the interface with additional visual elements indicating the updated
interactive plots; and
a user interface component configured to display the interface with the visual

elements indicating the interactive plots and the additional visual elements
indicating the updated interactive plots.
2. The system of claim 1 wherein the processor is configured to process the
user files using
a pseudo Bayesian Information criterion for the automatic hyperparameter
selection.
- 27 -

3. The data visualization system of claim 2, wherein the pseudo Bayesian
information
criterion is applied to automatically generate a best perplexity.
4. The data visualization system of claim 3 wherein the pseudo Bayesian
Information
criterion is computed using:
Image
where p is the perplexity, N is a number of data points of the user files, and
kl_div(p) is a
Kullback-Leibler divergence of t-SNE with perplexity p on the user files.
5. The system of claim 3 wherein the processor is configured to implement
machine learning
to compute t-SNE with different perplexities to select the best perplexity.
6. The system of claim 2 wherein the selectable indicia comprises a slider
to select a value
for a perplexity for the pseudo Bayesian Information criterion to update the
interactive
plots.
7. The system of claim 1 wherein the processor is configured to implement
an unsupervised
learning process for the automatic hyperparameter selection.
8. The system of claim 1 wherein the processor is configured to process the
user files by
applying the automatic hyperparameter selection to reduce the dimensionality
of the user
files for generation of the interactive plot.
9. The system of claim 1 wherein the user files comprise high dimensional
data and the
interactive plot comprises two dimensional data or three dimensional data, the
processor
being configured to the process the user files by applying the automatic
hyperparameter
selection to reduce the dimensionality of the user files from the high
dimensional data to
the two dimensional data or the three dimensional data.
10. The system of claim 1 wherein the interactive plots represent scatter
plots linked to
histograms of an original dimension of the user files to show a comparison
between
distributions of selected data point or the subset of data points.
11. The system of claim 1 wherein the data point represents an outlier data
point or the
subset of data points represents a cluster.
- 28 -

12. The system of claim 1 wherein the interface has the selectable indicia
configured to be
responsive to input to trigger an operation for chaining of union or intersect
selections of
the selected data point or the subset of data points.
13. The system of claim 1 wherein the processor is configured to process
the user files to
reduce the dimensionality of the user files for generation of the interactive
plots using
dimensionality reduction processes PCA, ICA and t-SNE, the interactive plots
comprising
reduction results from the dimensionality reduction processes.
14. The system of claim 10 wherein the interactive plots comprise a first
scatter plot for the
dimensionality reduction process PCA, a second scatter plot for the
dimensionality
reduction process ICA, and third scatter plot for the dimensionality reduction
process t-
SNE, and a plurality of histograms showing distributions for the
dimensionality reduction
processes.
15. The system of claim 1 wherein the processor is configured to store
received input in a
data storage as past selections for use in generating a union or intersect.
16. The system of claim 1 wherein the selected data point or the subset of
data points is from
a first interactive plot which triggers generation of an automatic update of
visual elements
for other interactive plots at the interface.
17. The system of claim 1 wherein the selectable indicia are logical anchor
points of the visual
elements that are indicative of an interactive ability to control
visualization and the
interface.
18. The system of claim 1 wherein the processor is configured to preprocess
the user files to
correct missing values, sett appropriate types, and compute descriptive data.
19. A data visualization process for generating one or more visualizations
indicative of
chaining of union or intersect of selections, the process comprising:
at a processor,
receiving user files;
processing the user files by applying using a pseudo Bayesian Information
criterion for automatic hyperparameter selection, wherein the pseudo
- 29 -

Bayesian information criterion is applied to automatically generate an
optimal perplexity;
generating interactive plots using the processed user files, the interactive
plots indicative of chaining of union or intersect of selections;
storing the interactive plots;
generating an interface with visual elements indicating the interactive plots,

the interface having selectable indicia configured to be responsive to input
to select a data point or subset of data points of the interactive plots;
responsive to the selectable indicia, generating updated interactive plots
based on the selected data point or subset of data points;
storing the updated interactive plots;
updating the interface with additional visual elements indicating the
updated interactive plots; and
at an a user interface component,
displaying the interface with the visual elements indicating the interactive
plots and dynamically updating the interface the additional visual elements
indicating the updated interactive plots.
20.
A computer readable medium storing machine executable instructions to
configure a
processor for generating one or more visualizations indicative of chaining of
union or
intersect of selections by:
receiving user files;
processing the user files by applying using a pseudo Bayesian Information
criterion for automatic hyperparameter selection, wherein the pseudo Bayesian
information criterion is applied to automatically generate an optimal
perplexity;
generating interactive plots using the processed user files, the interactive
plots
indicative of chaining of union or intersect of selections;
- 30 -

storing the interactive plots;
generating an interface with visual elements indicating the interactive plots,
the
interface having selectable indicia configured to be responsive to input to
select a
data point or subset of data points of the interactive plots;
responsive to the selectable indicia, generating updated interactive plots
based on
the selected data point or subset of data points;
storing the updated interactive plots;
updating the interface with additional visual elements indicating the updated
interactive plots; and
controlling the display of the interface, at a display device, with the visual
elements
indicating the interactive plots and dynamically updating the interface the
additional visual elements indicating the updated interactive plots.
- 31 -

Description

Note: Descriptions are shown in the official language in which they were submitted.

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
SYSTEM AND METHOD FOR ADAPTIVE DATA VISUALIZATION
FIELD
[0001] The present disclosure generally relates to the field of data
visualization, and more
particularly, to data visualization having regard to hyperparameter selection.
INTRODUCTION
[0002] Data visualization is an informative step in the process of data
analytics and business
intelligence that presents the high-dimensional data in a human-understandable
way.
[0003] Data visualization can lead to the discovery of novel hidden
patterns within the data that
may not be understandable or discoverable in the visualization of underlying
raw data.
[0004] Data visualization can automatically generate visual elements for an
interface by
identifying the patterns in raw data sets.
SUMMARY
[0005] In accordance with an aspect, there is provided a data
visualization system for
generating one or more visualizations indicative of chaining of union or
intersect of selections.
The data visualization system involves a processor configured to process
machine readable
instructions to: receive user files; process the user files by applying an
automatic hyperparameter
selection; generate interactive plots using the processed user files, the
interactive plots indicative
of chaining of union or intersect of selections; store the interactive plots;
generate an interface
with visual elements indicating the interactive plots, the interface having
selectable indicia
configured to be responsive to input to select a data point or subset of data
points of the
interactive plots; responsive to the selectable indicia, generate updated
interactive plots based on
the selected data point or subset of data points; store the updated
interactive plots; update the
interface with additional visual elements indicating the updated interactive
plots; and a user
interface component configured to display the interface with the visual
elements indicating the
interactive plots and the additional visual elements indicating the updated
interactive plots. In
some embodiments, the processor is configured to process the user files using
a pseudo
Bayesian Information criterion for the automatic hyperparameter selection.
[0006] In some embodiments, the pseudo Bayesian information criterion is
applied to
automatically generate a best perplexity.

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[0007] In some embodiments, the pseudo Bayesian Information criterion is
computed using:
where p is the perplexity, N is a number of data points of the user files, and
kl_div(p) is a
Kullback-Leibler divergence of t-SNE with perplexity p on the user files.
[0008] In some embodiments, the processor is configured to implement
machine learning to
compute t-SNE with different perplexities to select the best perplexity.
[0009] In some embodiments, the selectable indicia has a slider to select
a value for a
perplexity for the pseudo Bayesian Information criterion to update the
interactive plots.
[0010] In some embodiments, the processor is configured to implement an
unsupervised
learning process for the automatic hyperparameter selection.
[0011] In some embodiments, the processor is configured to process the user
files by applying
the automatic hyperparameter selection to reduce the dimensionality of the
user files for
generation of the interactive plot.
[0012] In some embodiments, the user files are high dimensional data and
the interactive plots
are two dimensional data or three dimensional data, the processor being
configured to the
process the user files by applying the automatic hyperparameter selection to
reduce the
dimensionality of the user files from the high dimensional data to the two
dimensional data or the
three dimensional data.
[0013] In some embodiments, the interactive plots represent scatter plots
linked to histograms
of an original dimension of the user files to show a comparison between
distributions of selected
data point or the subset of data points.
[0014] In some embodiments, the data point represents an outlier data
point or the subset of
data points represents a cluster.
[0015] In some embodiments, the interface has the selectable indicia
configured to be
responsive to input to trigger an operation for chaining of union or intersect
selections of the
selected data point or the subset of data points.
[0016] In some embodiments, the processor is configured to process the
user files to reduce
the dimensionality of the user files for generation of the interactive plots
using dimensionality
reduction processes PCA, ICA and t-SNE, the interactive plots comprising
reduction results from
the dimensionality reduction processes.
- 2 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[0017] In some embodiments, the interactive plots have a first scatter
plot for the
dimensionality reduction process PCA, a second scatter plot for the
dimensionality reduction
process ICA, and third scatter plot for the dimensionality reduction process t-
SNE, and a plurality
of histograms showing distributions for the dimensionality reduction
processes.
[0018] In some embodiments, the processor is configured to store received
input in a data
storage as past selections for use in generating a union or intersect.
[0019] In some embodiments, the selected data point or the subset of data
points is from a first
interactive plot which triggers generation of an automatic update of visual
elements for other
interactive plots at the interface.
[0020] In some embodiments, the selectable indicia are logical anchor
points of the visual
elements that are indicative of an interactive ability to control
visualization and the interface.
[0021] In some embodiments, the processor is configured to preprocess the
user files to
correct missing values, sett appropriate types, and compute descriptive data.
[0022] In accordance with an aspect, there is provided a data
visualization process for
.. generating one or more visualizations indicative of chaining of union or
intersect of selections.
The process involves: at a processor, receiving user files; processing the
user files by applying
using a pseudo Bayesian Information criterion for automatic hyperparameter
selection, wherein
the pseudo Bayesian information criterion is applied to automatically generate
an optimal
perplexity; generating interactive plots using the processed user files, the
interactive plots
indicative of chaining of union or intersect of selections; storing the
interactive plots; generating
an interface with visual elements indicating the interactive plots, the
interface having selectable
indicia configured to be responsive to input to select a data point or subset
of data points of the
interactive plots; responsive to the selectable indicia, generating updated
interactive plots based
on the selected data point or subset of data points; storing the updated
interactive plots; updating
the interface with additional visual elements indicating the updated
interactive plots; and at an a
user interface component, displaying the interface with the visual elements
indicating the
interactive plots and dynamically updating the interface the additional visual
elements indicating
the updated interactive plots.
[0023] In accordance with an aspect, there is provided a computer
readable medium storing
machine executable instructions to configure a processor for generating one or
more
visualizations indicative of chaining of union or intersect of selections by:
receiving user files;
- 3 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
processing the user files by applying using a pseudo Bayesian Information
criterion for automatic
hyperparameter selection, wherein the pseudo Bayesian information criterion is
applied to
automatically generate an optimal perplexity; generating interactive plots
using the processed
user files, the interactive plots indicative of chaining of union or intersect
of selections; storing the
interactive plots; generating an interface with visual elements indicating the
interactive plots, the
interface having selectable indicia configured to be responsive to input to
select a data point or
subset of data points of the interactive plots; responsive to the selectable
indicia, generating
updated interactive plots based on the selected data point or subset of data
points; storing the
updated interactive plots; updating the interface with additional visual
elements indicating the
updated interactive plots; and controlling the display of the interface, at a
display device, with the
visual elements indicating the interactive plots and dynamically updating the
interface the
additional visual elements indicating the updated interactive plots.
[0024] In accordance with an aspect, there is provided a data
visualization system for
generating one or more visualizations indicative of chaining of
union/intersect of selections, the
data visualization system comprising: a flask server configured for receiving
user files; and a
visualization server configured to process the user files to generate
interactive plots by applying
pseudo Bayesian Information criterion for automatic hyperparameter selection;
and a graphical
user interface component configured to generate the one or more visualizations
based on the
interactive plots received from the Bokeh server.
[0025] In some embodiments, the pseudo Bayesian information criterion is
applied to generate
the best perplexity automatically without any human prior.
[0026] In various further aspects, embodiments provide corresponding
systems and devices,
and logic structures such as machine-executable coded instruction sets for
implementing
systems, devices, and methods described herein.
[0027] In this respect, before explaining at least one embodiment in
detail, it is to be
understood that the embodiments are not limited in application to the details
of construction and
to the arrangements of the components set forth in the following description
or illustrated in the
drawings.
[0028] Also, it is to be understood that the phraseology and terminology
employed herein are
for the purpose of description and should not be regarded as limiting.
- 4 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[0029] Many further features and combinations thereof concerning
embodiments described
herein will appear to those skilled in the art following a reading of the
instant disclosure.
DESCRIPTION OF THE FIGURES
[0030] In the figures, embodiments are illustrated by way of example. It
is to be expressly
understood that the description and figures are only for the purpose of
illustration and as an aid to
understanding.
[0031] Embodiments will now be described, by way of example only, with
reference to the
attached figures, wherein in the figures:
[0032] FIG. 1 an example system architecture diagram, according to some
embodiments.
[0033] FIG. 2 is an illustration of a process flow including a human-in-the-
loop model,
according to some embodiments.
[0034] FIG. 3A and FIG. 3B are plots of KL divergence according to some
embodiments.
[0035] FIG. 4 is an example set of graphical user interfaces that may be
generated and shown
to a user, according to some embodiments.
[0036] FIG. 5 is a plot of preference score against perplexity, for example
Gaussian Blobs
datasets, according to some embodiments.
[0037] FIG. 6 is a plot of preference score against perplexity, for
example Gaussian Blobs
datasets, according to some embodiments, with a star showing the perplexity
obtained from
Pseudo BIC.
[0038] FIG. 7 an example system architecture diagram according to some
embodiments.
[0039] FIG. 8 is a diagram of an example computing device according to
some embodiments.
DETAILED DESCRIPTION
[0040] The following discussion provides many example embodiments of the
inventive subject
matter. Although each embodiment represents a single combination of inventive
elements, the
inventive subject matter is considered to include all possible combinations of
the disclosed
elements. Thus if one embodiment comprises elements A, B, and C, and a second
embodiment
- 5 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
comprises elements B and D, then the inventive subject matter is also
considered to include other
remaining combinations of A, B, C, or D, even if not explicitly disclosed.
[0041] The embodiments of the devices, systems and methods described
herein may be
implemented in a combination of both hardware and software. These embodiments
may be
implemented on programmable computers, each computer including at least one
processor, a
data storage system (including volatile memory or non-volatile memory or other
data storage
elements or a combination thereof), and at least one communication interface.
[0042] Program code is applied to input data to perform the functions
described herein and to
generate output information. The output information is applied to one or more
output devices. In
some embodiments, the communication interface may be a network communication
interface. In
embodiments in which elements may be combined, the communication interface may
be a
software communication interface, such as those for inter-process
communication. In still other
embodiments, there may be a combination of communication interfaces
implemented as
hardware, software, and combination thereof.
[0043] Systems and methods are described in some embodiments to provide
useful tools
adapted to visualize and interact with characteristics and relationships of
large data sets that
would be impractical or impossible to see through normal plotting. The
visualizations and
interactions are designed to promote the revealing of previously unknowable
interrelationships
between data sets, which may emerge under different hyperparameter selection
conditions.
[0044] In various embodiments, innovative methods and processes for automatic
hyperparameter selection (e.g. without human input) is described, and
corresponding
hyperparameter selection methodologies and specially configured systems.
[0045] Several data visualization software systems are available,
developed by TableauTm,
PalantirTM, AirbnbTM, etc. These systems are usually in the form of web
services, and they are
interactive in the sense that the user can easily change the view of the data
presented form the
front-end in read-time. Nevertheless, even though they tend to be easy-to-use
and versatile, they
only provide basic statistics about the data. They lack the capability to
leverage the power of
machine learning processes.
[0046] Unsupervised machine learning processes, including dimensionality
reduction methods,
can be used for visualizing high-dimensional data while pertaining the
intrinsic structures and
patterns. Nevertheless, nonlinear dimensionality reduction algorithms, namely,
t-Distributed
- 6 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
Stochastic Neighbour Embedding (t-SNE), can depend on many hyper-parameters to
obtain
results whose quality is hard to quantify, and therefore manual inspection may
be required. This
usually results in a very time-consuming process. Embodiments described herein
enable
automatic hyper-parameters selection for data processing.
[0047] An interactive data visualization system implementing unsupervised
learning processes
which automatically choose the hyperparameters for the user is provided
according to some
embodiments. The system is adapted for generating insights, when visualizing
the data by
interface generation, to provide the most informative results while providing
interfaces and
processes that have an ease of use and may be entertaining (e.g., fun) for a
user. To address the
problem of visualization hyperparameter selection, an innovative approach to
automatically
choose the best hyperparameter without user involvement is provided, in some
embodiments. A
system is directed to embed an efficient interactive data visualization
procedure into the practical
day-to-day workflow of data scientists and quantitative analysts across
different departments of
an organization.
[0048] Traditionally, data analytics are done with the analyst's experience
and prior knowledge.
Given the data, the analyst usually comes up with the hypothesis first, and
then the analyst writes
programs or SQL scripts accordingly to query the database, to validate their
hypothesis. This
paradigm is limited by human imagination. With the emerging field of data
science, the
importance of data visualization is growing as it makes it possible to better
understand the raw
data with improved and automatic generation of visual elements for an
interface.
[0049] Data visualization allows for a joint effort of machine learning
and human intuition to find
valuable and useful patterns within the data, and then generate hypothesis
from the patterns.
Data visualization systems can allow that the user can easily select what
aspects and which
portion of the data to view and update the interface dynamically in response
these input control
.. commands. Embodiments can use unsupervised machine learning processes, in
particular
dimensionality reduction processes, embedded within the visualization tools.
[0050] Dimensionality reduction algorithms project the originally very
high-dimensional data
into two or three dimensions, which can be perceived by human visually. This
projection can be at
a price of losing information from the original space, especially for linear
dimensionality
algorithms including principle component analysis (PCA) and independent
component analysis
(ICA).
- 7 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[0051] The t-distributed stochastic neighbour embedding (t-SNE) method is
such a nonlinear
approach that pertains the local structure in the high-dimensional space. The
approach is non-
linear and adapts to the underlying data, performing different transformations
on different regions.
Thus, the approach can faithfully keep the interesting structures and patterns
visible in the low-
dimensional embeddings. This approach cam have a hyperparameter identified as
"perplexity",
which describes (loosely) the trade-off between how much local and how much
global information
to keep during the projection. The perplexity value has a complex effect on
the resulting pictures.
[0052] For example, the performance of t-SNE can be fairly robust to
changes in the perplexity,
and typical values are between 5 and 50 in some examples. But this is more
nuanced than
described. The quality of the simulation might be hard to quantify and might
only be told when a
user sees the result. Getting the most from t-SNE may mean manually analyzing
multiple plots
with different perplexities. One usually chooses factors by experience and
with luck after a few
trials it will show a good result. Accordingly, the hyperparameter selection
process is typically
very time- and mind-consuming. Embodiments described herein provide an
improved
hyperparameter selection process.
[0053] Therefore, an easy-to-use interactive data visualization system
with dimensionality
reduction capability can be useful in various applications by data analysts
and scientists.
[0054] The data visualization system allows data analysts and scientists
to perform better in
pattern discovery and hypothesis generation to create business values form the
data. In an
embodiment, the system reduces the original high-dimensional data into two or
three, and
presents the results in scatter plots to the users; users can then drill down
to inspect and find
more information about a specific point (e.g., an outlier) or a subset (e.g.,
a cluster). The data
visualization system, in an embodiment, is configured to provide all results
on the scatter plots
with one click, as well as generate real-time guidance for non-experts in the
form of help
messages. The plots can, for example, be linked to the histograms of the
original dimensions
showing the comparisons between the empirical distributions of selected data
points versus all.
[0055] A user may, through one or more generated interactive graphical
elements, then
interact with the results by looking at different aspects and subsets of the
original data using
complex operations including chaining of union/intersect selections with
irregular boxes on
different scatter plots, which is difficult and/or impossible using
traditional tools such as SQL
queries. Meanwhile, to reduce the burden associated with tuning
hyperparameters, the data
visualization system is configured to automatically select one or more optimal
parameters (e.g.,
- 8 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
the best parameter) for the hyperparameter selection. Accordingly, the system
allows the user to
discover the hidden patterns within the data.
[0056] In some embodiments, a web-based interactive data visualization
software system is
provided that is configured to facilitate the efficient discovery of the
hidden patterns within the
data, providing ease of use (e.g., some embodiments are adapted for loading
and visualizing the
data with one click). Various three dimensionality reduction methods, namely
PCA, ICA, and t-
SNE, may be configured for automatic operation, and the reduction results will
be shown on
scatter plots, whereas the histograms of the distribution of each original
column from the input
data are also provided. All plots will be interactive and linked, meaning one
can select a subset of
data from either plot, and the selected data will also be highlighted on all
other plots. Help
messages may be generated in real or near real time (e.g., on the fly) to
guide users in using the
system. Interfaces and processing units may be configured to support union and
intersect of
multiple shots of selections. The selected subset can be outputted and load
again for the finer-
grain analysis. The system may then select the best hyperparameter for t-SNE
automatically.
[0057] FIG. 1 is an example system architecture diagram 100, according to
some
embodiments.
[0058] In a practical example implementation, in order to make such a web-
based interactive
data visualization system that leverage the power of machine learning
algorithms possible, the
system may be built on Python in some examples. Other programming languages
are possible. A
data file 102 is received which may be referred to as a user file.
[0059] The backend has a Flask server 104 configured for managing files,
and a Bokeh server
108 configured for computations, plots generation, help message generation,
and machine
learning. The Flask server is an example framework (i.e., a Python-based micro-
framework), and
other types of frameworks can be utilized for receiving and/or maintaining
data sets (e.g.,
including relational mappers and/or other extensions). The Bokeh server is an
example of a
visualization server, and other types of visualization mechanisms can be
utilized instead of a
Bokeh server. These servers can be implemented using one or more computing
systems that
include processors, computer-readable memory (e.g., random access memory, read
only
memory), storage media, among others, and the files can be obtained through
various
communication interfaces, such as interfaces that communicate through the
Internet, intranets,
point to point communications, among others.
- 9 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[0060] A front end 106 (e.g. an interface application to render an
interface on display device) is
provided that may include one or more graphical user interface generation
mechanisms that
receive as inputs requests for and data sets representing visualizations, and
generates
corresponding interactive and interface-able user interface elements. For
example, in visualizing
data, it may be desirable that data is not only shown, but can be modified
using selectable indicia
at the interface such that different views are possible and dynamically
generated in response to
activation of the selectable indicia. Accordingly, the front end 106 may be
configured to establish
logical anchor points (e.g. selectable indicia) where visual elements may be
deposited that are
indicative of an ability for a user to be able to interact with the
visualization itself, modifying how
the visualization effects are provided, among others. As an example, the front
end 106 may
instantiate / cause the rendering of handles or widgets / widget bars for
modifying
hyperparameter selection, filtering information, augmenting information shown,
causing rotations,
inversions, changes in perspective, toggling features (e.g., wireframe view),
among others.
Further, additional dimensions (e.g., > 4) may be processed to be renderable
in 2 or 3
dimensions (e.g., by selectively removing dimensions) and provided.
Visualizations need not be
restricted to traditional coordinate units, and various aspects may be
assigned to different types
of coordinate units so that different visualizations are possible (for
example, 3-D space can be
represented in any type of shape, not just Cartesian coordinates, and it may
be useful to visualize
information in the form of cylindrical coordinates, spherical coordinates, or
any coordinate system
that is possible). Non-Euclidean and manifold spaces are possible, and these
different
visualizations may be available for rendering to the user.
[0061] Front end 106 may be configured to assign different variables (or
newly generated
variables formed of a composite of other variables) to different types of
interface elements (e.g.
selectable indicia), and for example, voxels may be generated that are
representative of different
types of information, the voxels for rendering on a user interface.
[0062] In the Flask server 104, Flask Admin is used to manage user data
files 102. The Bokeh
server 108 may be configured to utilize a framework, such as Pandas, to
process the input data
using machine learning processes (e.g., Scikit-Learn) and data models. The
Bokeh server 108
may generate and push interactive plots and help messages onto the front-end.
A method for
.. union and intersect of data selections is applied, as well as a pseudo
Bayesian Information
Criterion (pBIC) for automatic hyperparameter selection for t-SNE. To justify
the correctness of
pBIC method, Applicants designed and performed user studies to learn the
users' preferences on
the hyperparameter. The result shows pBIC method was found to work well in
operation.
-10-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[0063] The system architecture is shown in FIG. 1. On the backend, there
may be two servers
or processors running simultaneously, and the servers may be remote from one
another. In other
embodiments, there may be only one server which handles various operations.
The Flask server
104 can be configured to manage the home and the tutorial page on the front-
end . The Flask
server 104 also has an input file management module, which uses Flask Admin,
for example, to
manage data files that the users have uploaded to the server through [1].
[0064] The Bokeh server 108 is configured to handle computations in this
system 100, and is
configured to receive the data file from the file manager on the Flask server
104 that has been
passed to the Bokeh server 108. The Bokeh server 108 for example, can apply a
Pandas
DataFrame (e.g., a Python DataFrame) to preprocess the data. The preprocessing
aids in
rectifying issues with missing values, setting appropriate types (including
numeric, categorical,
and datetime) to each column, and computing certain descriptive information
about the input data
(number of rows and columns, etc.). The processed data can then be cast into a
Bokeh
ColumnDataSource data structure, which can be a wrapper on the DataFrame, the
server hosting
the data to be plotted on different plots on the front-end 106 (which may
further allow for
manipulation by the provisioning and rendering of interactive interface
elements). Next, data is
transmitted to a machine learning unit. The system uses various three
dimensionality reduction
processes, namely, PCA, ICA, and t-SNE. PCA and ICA are fast to compute, and
once they are
completed, the reduction results are be recorded into the corresponding
ColumnDataSource.
[0065] The generation of plots is done by the Bokeh server 108. The Bokeh
server 108
generates, for example, three scatter plots for all three dimensionality
reduction processes and a
number of histograms showing the empirical distribution of each original
column, and then pushes
the processed data to the front-end 106. The ColumnDataSource can, in some
embodiments, be
a source of the data for all plots. The plots may be linked to one another
(e.g., modifications on
one lead to modifications on another).
[0066] When there are points selected from either of the plots or
histograms, the
ColumnDataSource will capture the selection, and the selected points can be
highlighted on all of
the plots as shown on front end 106. Meanwhile, categorical and datetime
columns may, for
example, not taken into account by those algorithms and may be visualized
differently (e.g., using
different colours or other visual distinguishing elements). Various filters
and other types of visual
effects may be overlaid or applied such that visualizations are modified.
-11-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[0067] The front-end is in HTML with CSS and BootstrapTM. Bokeh plots can
be embedded in
the HTML files by utilizing BokehJS. On the Bokeh server 108, once the plots
are generated, the
plots are converted into BokehJS and sent to the front-end 106. The widgets on
the front-end
106, including tabs, buttons, dropdowns, etc., can be also generated with the
Bokeh plots
generation module. Once the user uses them from the front-end, the module may
be configured
to trigger callback functions that makes corresponding changes happen (e.g.,
dropdown to
choose which categorical feature to visualize using colors). For the ease of
use, some
embodiments also include a help message engine on the backend, which generates

corresponding help messages on the algorithm and the user operations in real-
time.
[0068] Complex selections, including union and intersect from different
plots, might not
supported natively by Bokeh. To address this deficiency, another
ColumnDataSource may be
used to host the selected data points, such that it stores the past selections
in memory, making it
easier to find the union or intersect. With these functionality, the user can
select data of interests
from different plots or histograms in a complex manner.
[0069] The user can chain multiple union and/or intersect selections with
irregular or
rectangular boxes on different scatter plots or histograms, to explore the
hidden patterns that
possibly defined by the high-order and non-trivial interactions of different
input features. These
hidden patterns can be identified, for example, by pattern recognition that
occurs when
information is presented in different visual forms. Manipulations of the
visualizations may aid in
identifying, or exploring patterns.
[0070] In contrast, traditionally users wrote SQL queries that stack
filtering or aggregation on
each column, to explore and verify their hypothesis, which is hard to know a
priori and take into
account the high-order interactions of the data. Moreover, it takes time and
effort to write and
debug SQL programs in conventional approaches, whereas embodiments of the
proposed
system are fully automatic, saving the development effort and let the user to
focus on the intrinsic
patterns of the data, which can potentially improve business outcomes.
[0071] Two of the dimensionality reduction methods used in this system
are fast to compute,
whereas t-SNE, a powerful process, has a hyperparameter called "perplexity".
In academia, it is
not clear how to set this parameter automatically. Accordingly, some
approaches include users
doing trials with different perplexities and look at pictures that t-SNE
produces to select the best
one. This is typically very time-consuming. In the present system, an
innovative approach called
- 12-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
Pseudo Bayesian Information Criterion (pBIC) is implemented by the system as a
process
adapted to compute the best perplexity without human prior:
= argminp p>( ¨ZogN kl_div (p)
(1)
where p is the perplexity, N is the number of data points, and kl_div(p) is
the Kullback-Leibler
divergence of the t-SNE with perplexity p on the given dataset.
[0072] In the machine learning module of the Bokeh server, 108,
multiprocessing is provided to
compute t-SNE with different perplexities scaling from 8 to the number of data
points N, and then
use Eq. (1) to select the best one to present on the front-end. On the front-
end 106 there may be
a slider provided for the user to adjust the value of perplexity manual, and
the corresponding
pictures will be extracted from the backend. In practical implementations on
various datasets, the
system was able to return results that tend to have clear cluster patterns,
whereas when the
perplexity is either too large or too small, the t-SNE results become quite
blurred (e.g., difficult to
distinguish).
[0073] In order to verify the correctness of the pBIC in Eq. (1),
Applicants developed a human-
in-the-loop system (as shown in FIG. 2 in illustration 200) that can capture
the users' preference
on the t-SNE perplexity. The approach first includes precomputing t-SNE
results with different
perplexities, and then randomly sampling two of them to present to the user
at. The user need to
consider and select which t-SNE pattern is better, or pass if not comparable
(too similar).
[0074] The reason that the system collects users data by asking for their
preferences is that
.. psychology shows when making such preferences, behaviours tend to be less
noisy than, for
example, marking 1-10 in scale. Once the user has made a preference at 202,
the preference is
passed to a modified Gaussian Processes model 204 for update. Using a probit
model, the
system attempts to find the maximum preference score of perplexity from
pairwise selections at
206. The system then prepares for the next loop by sampling another pair of t-
SNE results to
present to the user to query his/her preference at 208.
[0075] Two user experiments were conducted. For each experiment, the
users can see class
labels coloured. The experiment was designed because when the class labels are
available the
users tend to find a good t-SNE result easily. The first dataset is a
synthetic with 1300 points
coming from two Gaussian Blobs in a three-dimensional space. The pBIC in Eq.
(1) returns a best
-13-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
perplexity of 77.47, whereas Gaussian Processes infer from more than 100
preferences of 8
people the optimal is 99.82. The second dataset is contains 10 classes of hand-
writing digits with
1800 data points. The best perplexity from pBIC is 114.53, and Gaussian
Processes returns an
optimum of 77.84.
[0076] Given the fact that Applicants were searching for the best
perplexity in a large interval of
[8, 1300] for the first dataset and [8, 1800] for the second, Applicants'
results indicated that pBIC
returns fairly close results to the human selections. Furthermore, the
Gaussian processes model
is in a Bayesian framework, and the optimum is the best in the mean sense. In
other words, the
inferred optimum has some uncertainty. For example for the Gaussian blobs data
with label
colouring, the inferred optimum from Gaussian Processes at p = 99.82 has a
score in the 3-a
confidence bound between [0.97, 1.81], whereas at the pBIC optimum (p =
114.53) the mean
score is 1.25, which falls into this confidence bound. This is also true for
the results from the first
data set, where Gaussian Processes produce a confidence bound of score in
[2.21, 2.86] at p =
77.84) and the pBIC provides a perplexity of mean score 2.31 at p = 77.84.
Accordingly, the
perplexity given by pBIC has a good chance to be actually optimal.
[0077] t-Distributed Stochastic Neighbor Embedding (t-SNE) (Maaten and
Hinton, 2008) is
arguably the most widely used nonlinear dimensionality for data visualization
in machine learning
and data science. Using t-SNE requires tuning some hyperparameters, notably
the perplexity.
[0078] Although according to Maaten and Hinton (2008), t-SNE results are
robust to the
settings of perplexity, in practice, users would still have to interactively
select perplexity by
visually comparing results under multiple settings. The lack of automation in
selecting this crucial
hyperparameter poses difficulty for non-expert users who do not understand the
inner working of
the t-SNE algorithm. An approach is provided to automatically set perplexity,
which requires no
significant extra computation beyond runs of t-SNE optimization.
[0079] The proposed approach of some embodiments is based on an objective
that is function
of perplexity and resulting KL divergence of learned t-SNE. The system is
configured to motivate
the objective from the perspective of model selection and validation by
demonstrating that its
minimum agrees with human expert selection in empirical studies.
- 14-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
t-Distributed Stochastic Neighbor Embedding
[0080] t-SNE tries to preserve local neighborhood structure from high
dimensional space in low
dimensional space by converting pairwise distances to pairwise joint
distributions, and optimize
low dimensional embeddings to match the high and low dimensional joint
distributions.
Specifically let [xi} ni= 1 be high dimensional data points, and fyi} ni= 1
the corresponding low
dim embedding points, t-SNE defines joint distribution of point I;] as
follows: The low dimensional
joint distribution is
(1 +
qii _____________________________________ -1 = (1)
2\
and the high dimensional one is defined as symmetrized conditionals:
Pj = P)/2+p)/2n(2)
where
exp( ¨ ¨ x3112 /219-3)
Pzli (3)
exp(¨ Ilxs ¨x/2a)
[0081] Finally, the t-SNE optimizes fyi} to minimize the Kullback-Leibler
divergence from low
dimensional distribution Q to high dimensional P:
KL (P Q) = Epij log ____________________________________
(4)
Perplexity
[0082] In Eq. 3 contains aj which defines the local scale around xj. The
value for cri is not
optimized or specified by hand individually, but rather found by bisection
search to match a pre-
specified perplexity value Perp.
-15-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
Pern(1) 211(1).') where H (Pi) = ¨ AL]
log2 p
[0083] The perplexity of pi is
= , and is selected so that Perp(p j) =Perp. Perp is a hyperparameter
of the t-SNE algorithm and is
central to what structure t-SNE finds.
[0084] Larger Perp leads to larger aj across the board, so that for each
data point, more
neighbours have significant p
Automatic selection of perplexity
[0085] The value of Kullback¨Leibler (KL) divergence from different
perplexities cannot be
compared to assess the quality of embeddings, since the final KL divergence
typically decreases
as perplexity increases, as illustrated in plot 300A of FIG. 3A, so that model
selection based on
KL divergence alone will always lead to very large Perp.
[0086] However, the resulting embeddings from large Perp converge to a
Gaussian-like blob
and do not capture underlying pattern of the data. This suggests that trading
off between the final
KL divergence and a Perp could potentially lead to good embeddings. Based on
this intuition, the
system applies the following criteria:
S(Perp) =KL(PHQ) +log(n)Perp
(5)
[0087] Corresponding to KL in FIG. 3A, S as function of Perp is FIG. 3B,
as shown in plot
300B.
[0088] In later sections, examples will be provided demonstrating that
Perp that minimizes S
agrees with selection by human users across a number of datasets. Eq. 5 is
motivated by relating
the equation to Bayesian Information Criteria (BIC), and minimizing
description length.
Interpretation as reverse complexity tuning via pseudo BIC
[0089] Eq. 5 bears resemblance to Bayesian Information Criteria (BIC):
BIG = ¨2 log(L) log(n)k (6)
- 16-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
where the first term -2 log(t) is goodness-of-fit of the maximum-likelihood-
estimated model (L),
while the second term log(n)k controls the complexity of the model by
penalizing the number of
free parameters k scaled by log(n). BIC is a large sample approximation to the
negative marginal
log likelihood of the model, and minimizing BIC automatically balances data-
fit and model
.. complexity.
[0090] The two terms in Eq. 5 are analogous, but the way the complexity
changes is reversed:
instead of increasing complexity of model to fit data better, increasing Perp
reduces complexity of
the pattern in data to be modelled, so that the same lower dimensional space
can embed them
better.
[0091] This is because when projecting from high dimensional to low
dimensional spaces,
there is not enough "room" in lower dimensional space to preserve all
structure in high dimension,
i.e., the "crowding problem". As Perp increases, differences of distances
among points will
become less and less significant with respect to the length scales of the
kernel in P distribution,
and P will tend toward a uniform distribution.
[0092] The forward form of KL objective function in Eq. 4 has large cost
for under-estimating
probability at some point, but not for over-estimating. In other words, if pu
is large and qu is very
small, KL divergence from that term is large, but in the opposite direction of
small pu and large chi,
KL is not as affected. Increasing Perp leads to larger ai, and more uniform
Ai, so the easier is for
the student-t distribution in low dimensional space to assign sufficient
probability mass for all
points.
[0093] In short, increases Perp relaxes the problem by reducing the
amount of structure to be
modelled so that less error is made as measured by KL(P Q), but one pays a
price in the second
term of Eq. 5. The end result is the same, a balance between data-fit and
complexity of model
relative to data complexity is achieved. For this reason, this description
refers to S(Perp) in Eq. 5
as pseudo BIC in the experiments.
Minimizing Description Length
[0094] Minimum description length (Rissanen, 1978) is a way to realize
the Occam's razor
principle for model selection. It recognizes that a model that captures any
regularity in data can
compress the data accordingly, hence reduced description length of the data is
the description
length of model plus the description length of the data compressed under the
model.
-17-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[0095]
The KL(P Q) in Eq. 4 is the average number of extra bits required to
encode samples
from P using code optimized for Q. Since pil is assumed to be 0 in tSNE, then
M = (n2 ¨fl)/2 is the
number of unique pairwise probabilities. So M KL(P Q) is the total number of
extra bits required.
On the other hand, it takes ¨log (1/n) to encode the identity (index) of one
data point, and each
data point has Perp number of neighbors on average.
[0096]
Because of the symmetrization of pairwise joint probability in tSNE,
there are
L( I (
2
Perp bits required to encode all neighbor identity information. Taking
out the
factor of M, Eq. 5 is arrived at.
Validation With Actual Human Prior On Perplexity
[0097] In order to validate the correctness of the proposed Psuedo BIC, a
system is developed
in some examples to capture human prior on t-SNE pictures resulted from
different perplexities.
Given a dataset, this system shows a pair of t-SNE pictures at a time, asks
the user for
preference (user can manipulate the pictures, in some embodiments), and
continues for many
iterations. Once the user preferences are collected, the approach applies
Gaussian Processes
(GP) with probit model to infer the preferred perplexity and compare with the
Psuedo BIC results.
Experiments have shown that Psuedo BIC consistently produces perplexities that
are actually
preferred by the users automatically.
Extraction of Human Prior Using Gaussian Processes with Probit Model
[0098]
A naive way to extract human prior on the t-SNE visualizations from
different
perplexities is to iteratively create an instance with one perplexity and ask
a human to rate it. This
strategy suffers from the problem that human are not good at rating in a
consistent manner,
whereas judgements on preferences are more accurate. In some examples, the
system employs
a preference learning method using GP with probit model, which learns the
latent function from
pairwise preferences (Eric et al., 2008; Brochu et al., 2010). The latent
function here maps
perplexity to a human preference score, and the maximum of the it represents
the most preferred
perplexity from the human prior, which it is desirable to compare with the
result from Psuedo BIC.
[0099]
A system is built to collect human preferences on t-SNE pictures. The t-
SNE results are
precomputed from a linear grid of perplexities from 8 to the number of data
samples, from which
at each iteration two results are randomly sampled with replacement and
presented.
-18-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[00100] FIG. 4 is an example set of graphical user interfaces 400 that may be
generated and
shown to a user, according to some embodiments. As shown in FIG. 4, at some
iteration, two t-
SNE pictures are presented to the user, who is asked to select which one has a
better pattern,
namely, which one has clearly distinct clusters (show, for example, in buttons
LEFT, RIGHT, and
PASS, which are interactive interface elements). If they are not comparable,
the user just need to
pass the current query (e.g., by selecting PASS).
[00101] Once the user preferences are collected, the system may be configured
to use the GP
with probit model to infer the most preferred perplexity. This model is
originally developed for
Bayesian Optimization. In theory, Bayesian Optimization can optimize a black-
box function with
minimum amount of queries by trading-off between exploration and exploitation
using an
acquisition function. The reason Applicants are not, in some examples,
adopting such an active
learning strategy is to reduce the waiting time of the users. The t-SNE method
tends to be
computationally expensive, and embedding it on-the-fly in a human-in-the-loop
system turns out
to cause the users to be bored and distracted. On the contrary, having the t-
SNE pictures
precomputed and collecting user results by randomly selecting pairs to compare
in a fluid matter
can keep the user stay focused, and thus more accurate results can be
collected. The random
selection strategy can also be interpreted as a Bayesian Optimization
procedure that randomly
explore the problem space.
User Experiments
[00102] When designing the experiments for the users, a first question to ask
is can a user
figure out the pattern from a t-SNE picture given a specific dataset? The
answer is if there is no
obvious intrinsic (local) structure within the data in the high-dimensional
space, the user cannot
make a judgement on which picture to choose either.
[00103] For example, a dataset generated from one multivariate Gaussian does
not have any
useful local information, whereas two Gaussians will result in two distinct
clusters in two or three
dimension embeddings from t-SNE. Therefore, in some embodiments, the system is
configured to
select two datasets with clear intrinsic structures, and show in the interface
the colored class
labels (FIG. 5), to make it easier for the user to see how well does the
algorithm work to retain the
local structure in the original high-dimensional space. In FIG. 5, a Gaussian
Processors Posterior
500 is shown from preference learning, the x-axis showing perplexity, and y-
axis representing the
perplexity preference score. A line is shown with the mean, and the 3-a
confidence bounds are
shown. The star 502 shows the perplexity obtained from Pseudo BIC.
-19-

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[00104] For the synthetic Gaussian blobs dataset, there are 1100 points
generated from a 3-
dimentional Gaussian mixture of two uneven Gaussians. Ideally, t-SNE will
result in a pattern with
two distinct clusters. In this case, Psuedo BIC computes an optimal perplexity
of 77.47. The
system also collects 115 preferences from 4 users, and the Gaussian Processes
model
generates a posterior in the diagram 600 of FIG. 6. The Pseudo BIC result is
marked by a star
602, and it can be seen that they provides similar results. FIG. 6 shows a
Gaussian Processors
Posterior 600 from preference learning, the x-axis showing perplexity, and y-
axis representing the
perplexity preference score. A line is shown with the mean, and the 3-a
confidence bounds are
shown. The star 502 shows the perplexity obtained from Pseudo BIC.
[00105] GP has a maximum posterior mean at 99.82, which is slightly off from
the Psuedo BIC
result. However, consider that there are some uncertainties about the maximum
posterior mean.
In fact, when the perplexity is 77.47, the Gaussian Processes model has mean
1.25, which falls
into the confidence bound of where the perplexity is 99.82.
[00106] For the Digits data, there are 1797 data points with 10 distinct
classes. 138 preferences
are collected from 4 users, from which GP produces a posterior as in FIG. 6. A
perplexity of
114.53 is reported from Psuedo BIC, whereas GP returns a maximum mean at p =
80.11.
Nevertheless, the mean prediction at p = 114.94 still falls into the 3-a
confidence bound of GP at
p= 80.11.
[00107] In summary, demonstrated results indicate that Pseudo BIC returns a
perplexity that is
very close the one that preferred by human prior.
[00108] FIG. 7 an example system architecture diagram of a visualization
platform 200
according to some embodiments. The visualization platform 200 can implement
aspects of the
processes described herein.
[00109] The visualization platform 200 connects to interface application 740,
entities 760, and
data sources 780 (with databases 790) using network 730. Entities 760 can
interact with the
platform 200 to provide input data (e.g. user files) and receive output data.
Network 730 (or
multiple networks) is capable of carrying data and can involve wired
connections, wireless
connections, or a combination thereof. Network 730 may involve different
network communication
technologies, standards and protocols, for example. The interface application
740 can be
installed on a computing device to display an interface of visual elements
that can represent
dynamic visualizations that update in response to control commands at
interface and interactive
- 20 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
plots. The visual elements represent transformed raw data (e.g. user files)
that can be generated
using data models 716, flask unit 712, plot generator 710 and interface
generators 714.
[00110] The visualization platform 200 can include an I/O Unit 702, a
processor 704,
communication interface 706, and data storage 702. The processor 704 can
execute instructions
in memory 708 to implement aspects of processes described herein. The
processor 704 can
execute instructions in memory 708 to configure data models 716, flask unit
712, plot generator
710 and interface generators 714, and other functions described herein. The
visualization
platform 200 has a processor 704 configured to collect data from different
data sources 780 in a
network 730. On the backend, there may be multiple processors 704 running
simultaneously to
implement the processes described.
[00111] The visualization platform 200 can generate one or more visualizations
indicative of
chaining of union or intersect of selections. The visualization platform 200
has a processor 704
configured to process machine readable instructions to receive user files from
entities 760 and/or
data sources 780 (coupled to databases 790) for storage in data storage 702.
The visualization
platform 200 can process the user files by applying a hyperparameter selection
using data model
716. In some embodiments, the processor 704 is configured to process the user
files using a
pseudo Bayesian Information criterion of the data model 716 for automatic
application of the
hyperparameter selection. In some embodiments, the pseudo Bayesian information
criterion is
applied to automatically generate a best perplexity.
[00112] In some embodiments, the processor 704 is configured with the Flask
unit 712 to
preprocess the user files to correct missing values, sett appropriate types,
and compute
descriptive data. In some embodiments, the processor 704 is configured with
the Flask unit 712
to process the user files by applying the automatic hyperparameter selection
to reduce the
dimensionality of the user files for generation of the interactive plot. That
is, the user files are
reduced in dimension for generation of the interactive plots by plot generator
710. The user files
can be high dimensional data and the interactive plots can be two dimensional
data or three
dimensional data. The processor 704 is configured to the process the user
files by applying the
automatic hyperparameter selection to reduce the dimensionality of the user
files from the high
dimensional data to the two dimensional data or the three dimensional data. In
some
embodiments, the processor 704 is configured to process the user files to
reduce the
dimensionality of the user files for generation of the interactive plots using
dimensionality
- 21 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
reduction processes PCA, ICA and t-SNE, the interactive plots comprising
reduction results from
the dimensionality reduction processes.
[00113] The visualization platform 200 uses processor 704 configured with plot
generator 710 to
generate interactive plots using the processed user files and stores the
interactive plots in
databases 718. The interactive plots can be indicative of chaining of union or
intersect of
selections. In some embodiments, the plot generator 710 generates interactive
plots that have a
first scatter plot for the dimensionality reduction process PCA, a second
scatter plot for the
dimensionality reduction process ICA, and third scatter plot for the
dimensionality reduction
process t-SNE, and a plurality of histograms showing distributions for the
dimensionality reduction
processes.
[00114] The visualization platform 200 uses processor 704 configured with
interface generator
714 to generate an interface with visual elements indicating the interactive
plots. The interface
has selectable indicia configured to be responsive to input to dynamically
update the interactive
plots. The input can be a selection of a data point or a subset of data
points. In some
embodiments, the data point represents an outlier data point or the subset of
data points
represents a cluster. The input can be a manipulation of the interactive
plots. The input can be a
movement of a slider that represents perplexity. In some embodiments, the
interface has the
selectable indicia configured to be responsive to input to trigger an
operation for chaining of union
or intersect selections of the selected data point or the subset of data
points. In some
embodiments, the selected data point or the subset of data points is from a
first interactive plot
which triggers generation of an automatic update of visual elements for other
interactive plots at
the interface. In some embodiments, the selectable indicia are logical anchor
points of the visual
elements that are indicative of an interactive ability to control
visualization and the interface.
Other examples are described herein. In some embodiments, the processor 704 is
configured to
store received input in a data storage as past selections for use in
generating a union or intersect.
[00115] Responsive to the selectable indicia, the visualization platform 200
uses processor 704
configured with plot generator 710 to generate updated interactive plots and
interface generator
714 to generate additional visual elements indicating the updated interactive
plots. The
visualization platform 200 uses processor 704 configured with interface
generator 714 to update
the interface with the additional visual elements indicating updated
interactive plots. An interface
application 740 (e.g. a user interface component of a computing device) is
configured to display
- 22 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
the interface with the visual elements representing the interactive plots and
the additional visual
elements indicating the updated interactive plots.
[00116] In some embodiments, the processor 704 is configured to process the
user files using a
pseudo Bayesian Information criterion of the data models 716 for the automatic
hyperparameter
selection. In some embodiments, the pseudo Bayesian information criterion is
applied to
automatically generate a best perplexity. In some embodiments, the pseudo
Bayesian Information
criterion is computed using p for the perplexity, N as a number of data points
of the user files, and
kl_div(p) is a Kullback-Leibler divergence of t-SNE with perplexity p on the
user files. In some
embodiments, the processor 704 is configured to implement machine learning
(using rules of data
model 716) to compute t-SNE with different perplexities to select the best
perplexity. In some
embodiments, the processor 704 is configured to implement an unsupervised
learning process for
the automatic hyperparameter selection.
[00117] In some embodiments, the interface generator 714 generates the
selectable indicia with
a slider to select a value for a perplexity for the pseudo Bayesian
Information criterion to update
the interactive plots.
[00118] In some embodiments, the plot generator 710 can generate the
interactive plots as
scatter plots linked to histograms of an original dimension of the user files
to show a comparison
between distributions of selected data point or the subset of data points.
[00119] The I/O unit 702 can enable the platform 200 to interconnect with one
or more input
devices, such as a keyboard, mouse, camera, touch screen and a microphone,
and/or with one or
more output devices such as a display screen and a speaker.
[00120] The processor 704 can be, for example, any type of general-purpose
microprocessor or
microcontroller, a digital signal processing (DSP) processor, an integrated
circuit, a field
programmable gate array (FPGA), a reconfigurable processor, or any combination
thereof.
[00121] Memory 708 may include a suitable combination of any type of computer
memory that
is located either internally or externally such as, for example, random-access
memory (RAM),
read-only memory (ROM), compact disc read-only memory (CDROM), electro-optical
memory,
magneto-optical memory, erasable programmable read-only memory (EPROM), and
electrically-
erasable programmable read-only memory (EEPROM), Ferroelectric RAM (FRAM) or
the like.
Data storage devices 702 can include memory 708, databases 718 (e.g. graph
database), and
persistent storage 720.
- 23 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[00122] The communication interface 706 can enable the platform 200 to
communicate with
other components, to exchange data with other components, to access and
connect to network
resources, to serve applications, and perform other computing applications by
connecting to a
network (or multiple networks) capable of carrying data including the
Internet, Ethernet, plain old
telephone service (POTS) line, public switch telephone network (PSTN),
integrated services
digital network (ISDN), digital subscriber line (DSL), coaxial cable, fiber
optics, satellite, mobile,
wireless (e.g. VVi-Fi, VViMAX), SS7 signaling network, fixed line, local area
network, wide area
network, and others, including any combination of these.
[00123] The visualization platform 200 can be operable to register and
authenticate users (using
a login, unique identifier, and password for example) prior to providing
access to applications, a
local network, network resources, other networks and network security devices.
The visualization
platform 200 can connect to different machines or entities 760.
[00124] The data storage 702 may be configured to store information associated
with or created
by the platform 200. Storage 702 and/or persistent storage 720 may be provided
using various
types of storage technologies, such as solid state drives, hard disk drives,
flash memory, and
may be stored in various formats, such as relational databases, non-relational
databases, flat
files, spreadsheets, extended markup files, and so on.
[00125] FIG. 8 is a schematic diagram of computing device 800 which can
implement aspects of
different processes described herein. As depicted, computing device includes
at least one
processor 802, memory 804, at least one I/O interface 806, and at least one
network interface
808.
[00126] Each processor 802 may be, for example, microprocessors or
microcontrollers, a digital
signal processing (DSP) processor, an integrated circuit, a field programmable
gate array
(FPGA), a reconfigurable processor, a programmable read-only memory (PROM), or
combinations thereof. Processors are be used to implement the various logical
and computing
units of a system, as shown in FIGs. 1 and 7, for example, and different units
may have different
processors, or may be implemented using the same set of processors or the same
processor.
[00127] Memory 804 may include a suitable combination of computer memory that
is located
either internally or externally such as, for example, random-access memory
(RAM), read-only
memory (ROM), compact disc read-only memory (CDROM), electro-optical memory,
magneto-
optical memory, erasable programmable read-only memory (EPROM), and
electrically-erasable
- 24 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
programmable read-only memory (EEPROM), Ferroelectric RAM (FRAM). Memory 804
may be
used to store visualizations, insights, data relationships, etc.
[00128] Each I/O interface 806 enables computing device 800 to interconnect
with one or more
input devices, such as a keyboard, mouse, camera, touch screen and a
microphone, or with one
or more output devices such as a display screen and a speaker. I/O interfaces
806 can include
command line interfaces. These I/O interfaces 806 can be utilized to interact
with the system, for
example, to provide data inputs, preferences, etc.
[00129] Each network interface 808 enables computing device 800 to communicate
with other
components, to exchange data with other components, to access and connect to
network
resources, to serve applications, and perform other computing applications by
connecting to a
network (or multiple networks) capable of carrying data including the
Internet, Ethernet, plain old
telephone service (POTS) line, public switch telephone network (PSTN),
integrated services
digital network (ISDN), digital subscriber line (DSL), coaxial cable, fiber
optics, satellite, mobile,
wireless (e.g. VVi-Fi, VViMAX), SS7 signaling network, fixed line, local area
network, wide area
network, and others, including combinations of these. Network interfaces 808
are utilized, for
example, to receive inputs, transmit or transform visualizations for remote
devices, etc.
[00130] Embodiments of methods, systems, and apparatus are described through
reference to
the drawings.
[00131] Throughout the foregoing discussion, numerous references will be made
regarding
servers, services, interfaces, portals, platforms, or other systems formed
from computing devices.
It should be appreciated that the use of such terms is deemed to represent one
or more
computing devices having at least one processor configured to execute software
instructions
stored on a computer readable tangible, non-transitory medium. For example, a
server can
include one or more computers operating as a web server, database server, or
other type of
computer server in a manner to fulfill described roles, responsibilities, or
functions.
[00132] The technical solution of embodiments may be in the form of a software
product. The
software product may be stored in a non-volatile or non-transitory storage
medium, which can be
a compact disk read-only memory (CD-ROM), a USB flash disk, or a removable
hard disk. The
software product includes a number of instructions that enable a computer
device (personal
.. computer, server, or network device) to execute the methods provided by the
embodiments.
- 25 -

CA 03065841 2019-12-02
WO 2018/227277
PCT/CA2018/050545
[00133] The embodiments described herein are implemented by physical computer
hardware,
including computing devices, servers, receivers, transmitters, processors,
memory, displays, and
networks. The embodiments described herein provide useful physical machines
and particularly
configured computer hardware arrangements.
[00134] Although the embodiments have been described in detail, it should be
understood that
various changes, substitutions and alterations can be made herein.
[00135] Moreover, the scope of the present application is not intended to be
limited to the
particular embodiments of the process, machine, manufacture, composition of
matter, means,
methods and steps described in the specification.
[00136] As can be understood, the examples described above and illustrated are
intended to be
exemplary only.
- 26 -

Representative Drawing

A single figure which represents the drawing illustrating the invention.

Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee and Payment History should be consulted.

Administrative Status

Title	Date
Forecasted Issue Date	Unavailable
(86) PCT Filing Date	2018-05-08
(87) PCT Publication Date	2018-12-20
(85) National Entry	2019-12-02
Examination Requested	2022-09-14

Abandonment History

There is no abandonment history.

Maintenance Fee

Last Payment of $277.00 was received on 2024-04-08

Upcoming maintenance fee amounts

Description	Date	Amount
Next Payment if standard fee	2025-05-08	$277.00
Next Payment if small entity fee	2025-05-08	$100.00

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

the reinstatement fee;
the late payment fee; or
additional fee to reverse deemed expiry.

Patent fees are adjusted on the 1st of January every year. The amounts above are the current amounts if received by December 31 of the current year.
Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Payment History

Fee Type	Anniversary Year	Due Date	Amount Paid	Paid Date
Registration of a document - section 124		2019-12-02	$100.00	2019-12-02
Application Fee		2019-12-02	$400.00	2019-12-02
Maintenance Fee - Application - New Act	2	2020-05-08	$100.00	2019-12-02
Maintenance Fee - Application - New Act	3	2021-05-10	$100.00	2021-05-03
Maintenance Fee - Application - New Act	4	2022-05-09	$100.00	2022-04-12
Request for Examination		2023-05-08	$203.59	2022-09-14
Maintenance Fee - Application - New Act	5	2023-05-08	$210.51	2023-04-11
Maintenance Fee - Application - New Act	6	2024-05-08	$277.00	2024-04-08

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
ROYAL BANK OF CANADA

Past Owners on Record
None

Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.

Documents

To view selected files, please enter reCAPTCHA code :

To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Filter

Download Selected in PDF format (Zip Archive)

Download Selected as Single PDF

Document Description	Date (yyyy-mm-dd)	Number of pages	Size of Image (KB)
Abstract	2019-12-02	1	58
Claims	2019-12-02	5	171
Drawings	2019-12-02	8	246
Description	2019-12-02	26	1,345
Representative Drawing	2019-12-02	1	6
Patent Cooperation Treaty (PCT)	2019-12-02	1	41
International Search Report	2019-12-02	3	106
National Entry Request	2019-12-02	11	330
Cover Page	2020-01-07	1	31
Request for Examination	2022-09-14	4	105
Examiner Requisition	2023-12-21	6	321
Amendment	2024-04-19	67	3,609
Claims	2024-04-19	6	311
Description	2024-04-19	23	1,887

Language selection

Menus

English Abstract

French Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.

Patent 3065841 Summary

English Abstract

French Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.Requested information will be availablein a moment.Thank you for waiting.

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.