Language selection

Search

Patent 2986320 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent Application: (11) CA 2986320
(54) English Title: METHODS AND SYSTEMS FOR CONTEXT-SPECIFIC DATA SET DERIVATION FROM UNSTRUCTURED DATA IN DATA STORAGE DEVICES
(54) French Title: METHODES ET SYSTEMES DE DERIVATION D'ENSEMBLE DE DONNEES CONTEXTUELLES A PARTIR DE DONNEES NON STRUCTUREES DANS LES DISPOSITIFS DE STOCKAGE DE DONNEES
Status: Deemed Abandoned
Bibliographic Data
(51) International Patent Classification (IPC):
  • G6F 16/383 (2019.01)
(72) Inventors :
  • WEEKS, RUSS (Canada)
  • GEORGIOU, TRISTEN (Canada)
  • TO, TIM (Canada)
  • ROEHRL, JOSEF (Canada)
(73) Owners :
  • FUSEFORWARD TECHNOLOGY SOLUTIONS LIMITED
(71) Applicants :
  • FUSEFORWARD TECHNOLOGY SOLUTIONS LIMITED (Canada)
(74) Agent:
(74) Associate agent:
(45) Issued:
(22) Filed Date: 2017-11-21
(41) Open to Public Inspection: 2019-05-21
Examination requested: 2022-09-27
Availability of licence: N/A
Dedicated to the Public: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): No

(30) Application Priority Data: None

Abstracts

English Abstract


Described are various embodiments of systems, methods, and devices relating to
the
generation of independent context-specific datasets based on existing raw data
sets, some
embodiments comprising a plurality of data storage components for storage of a
plurality
of data objects; and a processing component having a data object key value
store
accessible thereto, said data object key value store configured to store a
unique key-value
logical row for constituent data object components of a data object, each such
key-value
logical row comprising a key for uniquely identifying the key-value logical
row; a
constituent data object component value for providing value information
relating to the
constituent data object component; and a metadata descriptor for describing a
data object
component characteristic of the constituent data object component value;
wherein at least
one of the constituent data object components are derived from raw data and at
least one
of the constituent data object components are derived from one or more other
constituent
data object components; and wherein, in response to a data access request
based on one
or more metadata descriptors.


Claims

Note: Claims are shown in the official language in which they were submitted.


CLAIMS
What is claimed is:
1. A data storage system for fulfilling a data request for a context-specific
data set,
said context-specific data set based on a raw data set, the system comprising:
a plurality of network-accessible hardware storage resources, each of said
hardware storage resources being in network communication and configured for
distributed storage of data objects;
a digital data processor responding to data access requests received over a
network and relating to the data objects;
a key-value store comprising a unique key-value logical row for each
constituent data component of each of said data objects, each said unique key-
value logical row comprising:
a key for identifying said unique key-value logical row;
a constituent data component value comprising stored digital
information relating to said constituent data component associated with
said unique key-value logical row; and
a metadata descriptor describing metadata of said constituent data
component value;
wherein at least one key-value logical row for a given data object is a
direct key-value logical row directly associated with the raw data set and
wherein
at least one key-value logical row for the given data object is a derived key-
value
logical row derived from one or more other key-value logical rows;
wherein, upon said digital data processor generating the context-specific
data set responsive to a given data request to the data storage system, said
digital
data processor further generates a re-identification risk value for the
context-
specific data set to be associated therewith, said re-identification risk
value
representative of a likelihood that a given constituent data component used to
generate the context-specific data set can be directly associated with an
identifiable subject to which said given constituent data component pertains;
and
43

wherein said given data request is selectively fulfilled by the data storage
system as a function of said re-identification risk value.
2. The data storage system of claim 1, wherein said re-identification risk
value is
generated based on similarities between an aspect of the given constituent
data
component, and a corresponding aspect of at least one other constituent data
component used in the context-specific data set.
3. The data storage system of claim 1, wherein the re-identification risk
is generated
based on at least one of the following calculated properties of the aspect of
the
context-specific data set: k-anonymity, t-closeness, l-diversity, and privacy
differential.
4. The data storage system of any one of claims 1 to 3, wherein each key-
value
logical row further comprises a sensitivity value indicating a sensitivity
associated
with a corresponding key-value logical row.
5. The data storage system of claim 4, wherein the sensitivity value is
associated
with one or more of the following: a permissible requesting user identifier
for the
corresponding key-value logical row or an aspect thereof, a predetermined
sensitivity tag associated with one or more aspects of the corresponding key-
value
logical row, the raw data sets from which the corresponding key-value logical
row
originated, or the metadata descriptor of the corresponding key-value logical
row.
6. The data storage system of any one of claims 1 to 5, wherein said given
data
request for the context-specific data set so generated in response thereto is
selectively fulfilled solely upon said re-identification risk value associated
with
the context-specific data set being lower than a designated re-identification
risk
threshold.
44

7. The data storage system of claim 6, wherein said re-identification risk
threshold is
automatically determined by the data storage system based on whether a
requesting computing device is within a designated zone of trust.
8. The data storage system of claim 6, wherein said re-identification risk
threshold is
automatically determined by the data storage system based on one or more of:
an
identity of a requesting user, a role of the requesting user, sensitivity of
data
components in the context-specific data set, a location of a requesting
computing
device, a security indication of the requesting computing device, or a
combination
thereof.
9. The data storage system of any one of claims 1 to 8, wherein at least
one said
derived key-value logical row is automatically generated from the raw data set
upon importing such raw data set into the data storage system.
10. The data storage system of any one of claims 1 to 8, wherein at least
one said
derived key-value logical row is derived upon request for such derivation by a
user of the system.
11. The data storage system of any one of claims 1 to 8, wherein at least
one said
derived key-value logical row is automatically derived from one or more pre-
existing direct or derived key-value logical rows that are associated with
said
given data request so to automatically increase a similarity between
corresponding
aspects of constituent data components derived from said one or more existing
key-value logical rows and thus reduce a given re-identification risk value
associated with a derived context-specific data set relying on said at least
one
derived key-value logical row given said similarity increase.
12. The data storage system of claim 11, wherein said derived key-value
logical row
is derived by obfuscating the constituent data component value of said pre-
existing key-value logical rows to generate the constituent data component
value

of the derived key-value logical row, and the corresponding metadata
descriptor
of the derived key-value logical row being generated based on said
obfuscating.
13. The data storage system of claim 11 or 12, wherein said derived context-
specific
data set is automatically generated by the data storage system upon said re-
identification risk value associated with a first context-specific data set
being too
high to permit selective fulfilment of said given data request.
14. The data storage system of any one of claims 11 to 13, wherein said
derived
context-specific data set is generated automatically upon said re-
identification risk
value associated with a first context-specific data set being higher than a
first
designated threshold.
15. A data storage method for fulfilling a data request for a context-
specific dataset
based one or more raw data sets, the method implemented on a data storage
system comprising a plurality of network-accessible hardware storage
resources,
each of said hardware storage resources being in network communication and
configured for distributed storage of data objects, and a digital data
processor for
responding to data storage requests received over a network and relating to
said
data objects, the method comprising:
storing a key-value store comprising a unique key-value logical row for
each constituent data component of each data object, each key-value logical
row
comprising:
a key for identifying the key-value logical row;
a constituent data component value comprising stored digital
information relating to the constituent data component
associated with the key-value logical row; and
a metadata descriptor describing metadata of a data component
value;
directly generating at least one of the key-value logical rows for a given
data object from raw data;
46

deriving at least one of the key-value logical rows for the given data object
from other key-value logical rows;
generating the context-specific data set responsive to the data request
generating, a re-identification risk value for the context-specific data set,
the re-identification risk value indicating a likelihood that a given
constituent data
component used to generate the context-specific data set can be directly
associated with an identifiable subject to which said given constituent data
component pertains; and
selectively fulfilling the context-specific data request as a function of said
re-identification risk value.
16. The method of claim 15, wherein said re-identification risk value is
generated
based on similarities between an aspect of the given constituent data
component,
and a corresponding aspect of at least one other constituent data component
used
in the context-specific data set.
17. The method of claim 16, wherein said re-identification risk value is
generated
based on at least one of the following calculated properties of the aspect of
the
context-specific data set: k-anonymity, t-closeness, l-diversity, or privacy
differential.
18. The method of any one of claims 15 to 17, wherein each key-value
logical row
further comprises a sensitivity value indicating a sensitivity associated with
the
corresponding key-value logical row.
19. The method of claim 18, wherein the sensitivity value is associated
with one or
more of the following: a permissible requesting user identifier for the
corresponding key-value logical row or an aspect thereof, a predetermined
sensitivity tag associated with one or more aspects of the corresponding key-
value
logical row, the raw data sets from which the corresponding key-value logical
row
originated, or the metadata descriptor of the corresponding key-value logical
row.
47

20. The method of any one of claims 15 to 19, said selectively fulfilling
comprises
fulfilling the data request solely upon said re-identification risk value
associated
with the context-specific data set being lower than a designated risk
threshold.
21. The method of claim 20, wherein said risk threshold is determined based
on
whether a requesting computing device is within a designated zone of trust.
22. The method of claim 20, wherein said risk threshold is determined based
on one
or more of: an identity of a requesting user, a role of the requesting user,
sensitivity of data components in the context-specific data set, a location of
a
requesting computing device, a security indication of the requesting computing
device, or a combination thereof.
23. The method of any one of claims of 15 to 22, further comprising:
automatically generating a derived context-specific data set to fulfil the
data request, wherein the derived context-specific data set is based on at
least one
derived key-value logical row that is automatically derived from one or more
pre-
existing direct or derived key-value logical rows associated with the data
request
so to automatically increase a similarity between corresponding aspects of
constituent data components derived from said one or more pre-existing key-
value
logical rows and thus reduce a given re-identification risk value associated
with
said derived context-specific data set given said similarity increase.
24. The method of claim 23, wherein the at least one derived key-value
logical row is
derived by obfuscating the constituent data component value of the
corresponding
one or more pre-existing key-value logical rows to generate the constituent
data
component value of the derived key-value logical row, and the corresponding
metadata descriptor of the derived key-value logical row being generated based
on
said obfuscating.
48

25. The method of claim 23 or 24, wherein the derived context-specific data
set is
generated upon the re-identification risk associated with a first context-
specific
data set being too high to permit selective fulfilment of the data request.
26. The method of any one of claims 20 to 22, wherein the derived context-
specific
data set is generated automatically upon the re-identification risk value
associated
with a first context-specific data set being higher than a first designated
risk
threshold.
27. A device for fulfilling a data request for a context-specific dataset
based on an
existing raw data set, the device being in network communication with a
plurality
of network-accessible hardware storage resources, each of said hardware
storage
resources being in network communication, and configured for distributed
storage
of data objects, the device comprising:
a digital data processor for responding to data storage requests received
over a network and relating to said data objects; and
a network communications interface for communicatively interfacing one
or more requesting users and a key-value store configured to store a unique
key-
value logical row for each constituent data object component of each data
object,
each such key-value logical row comprising:
a key for identifying the key-value logical row;
a constituent data component value comprising stored digital
information relating to the constituent data component associated with the
key-value logical row; and
a metadata descriptor describing metadata of the constituent data
component value;
wherein at least one key-value logical row for a given data object is a
direct key-value logical row directly associated with raw data and wherein at
least
one key-value logical row for the given data object is a derived key-value
logical
row derived from one or more other key-value logical rows; and
49

wherein, upon said digital data processor generating the context-specific
data set responsive to a given data request to the data storage system, said
digital
data processor further generates a re-identification risk value for the
context-
specific data set to be associated therewith, said re-identification risk
value
representative of a likelihood that a given constituent data component used to
generate the context-specific data set can be directly associated with an
identifiable subject to which said given constituent data component pertains;
and
wherein said given data request is selectively fulfilled by the data storage
system as a function of said re-identification risk value.
28. A computer-readable medium having stored thereon instructions for
execution by
a computing device for fulfilling a data request for a context-specific
dataset
based on an existing raw data set, said computing device being in network
communication with a data storage system comprising a plurality of data
storage
components, each of said data storage components being in network
communication, and configured for distributed storage of a plurality of data
objects, each said data object comprising of a plurality of constituent data
object
components, the instructions executable to automatically implement the steps
of
any one of the methods of claims 15 to 26.
29. A data storage system for fulfilling a data request for a context-
specific data set,
said context-specific data set based on a raw data set, the system comprising:
a plurality of network-accessible hardware storage resources, each of said
hardware storage resources being in network communication and configured for
distributed storage of data objects;
a digital data processor responding to data access requests received over a
network and relating to the data objects;
a key-value store comprising a unique key-value logical row for each
constituent data component of each of said data objects, each said unique key-
value logical row comprising:
a key for identifying said unique key-value logical row;

a constituent data component value comprising stored digital
information relating to said constituent data component associated with
said unique key-value logical row; and
a metadata descriptor describing metadata of said constituent data
component value;
wherein, in response to a given data request, said digital data processor:
generates a first context-specific data set based on existing key-
value logical rows;
associates a re-identification risk value with said first context-
specific data set representative of a likelihood that a given constituent data
component used to generate said first context-specific data set can be
directly associated with an identifiable subject to which said given
constituent data component pertains;
selectively fulfils said given data request based on said re-
identification risk value by:
providing access to said first context-specific data set upon
said re-identification risk satisfying a designed risk criteria;
otherwise
automatically generating and providing access to a derived
context-specific data set so to fulfil the data request, wherein the
derived context-specific data set is based on at least one derived
key-value logical row that is automatically derived from one or
more pre-existing direct or derived key-value logical rows
associated with the data request so to automatically increase a
similarity between corresponding aspects of constituent data
components derived from said one or more pre-existing key-value
logical rows and thus reduce a given re-identification risk value
associated with said derived context-specific data set given said
similarity increase.
51

Description

Note: Descriptions are shown in the official language in which they were submitted.


METHODS AND SYSTEMS FOR CONTEXT-SPECIFIC DATA SET DERIVATION
FROM UNSTRUCTURED DATA IN DATA STORAGE DEVICES
FIELD OF THE DISCLOSURE
[0001] The present disclosure relates to scalable, secure, and policy-
compliant
distributed data storage systems, and, in particular, to methods and systems
for context-
specific data set derivation from unstructured data in data storage devices.
BACKGROUND
[0002] Organizations are storing increasingly large amounts of diverse,
unstructured
data. This data is not typically in a format that can be easily analyzed and
stored using
traditional storage and analytics systems. Moreover, there is often sensitive
information
within the data and there is a need to ensure it is properly governed. This
includes not
only granular access control but also the ability to create and supply
datasets that conform
to existing privacy and compliance standards and follow strict retention
policies on the
data.
[0003] Two approaches that have been adopted across various industries
include (i)
the use of relational databases, and (ii) Hadoop file systems, used for
distributed storage
and for processing dataset of big data. Relational databases apply either
row/column level
security or stricter security at the application level, and therefore have
difficulty in
scaling and applying operational requirements to a more granular set of data,
including
when the granularization requirements may be dynamic or unknown at data
upload.
Hadoop file systems use heterogeneous data stores with an external governance
framework that can be used to tag and govern the data. The level of control,
in Hadoop
systems, varies according to the capabilities of the underlying system. For
instance, a
document store can protect at the document level but not within the document
(which
may also be a problem for relational databases as well). This makes it
difficult to apply a
uniform and granular governance model to the data and to create flexible
enough datasets
to conform to modern privacy rules.
1082P-RRI-CAD! 1
CA 2986320 2017-11-21

[0004] Further, relational databases are unable to scale to the sizes
required to store
big data today (Petabytes). They are often expensive to scale and will scale
vertically
while a key/value store is horizontally scalable. They do not provide flexible
curation of
the data since they have fixed schemas and cannot run any type of processing
function
across all the data stored. Databases are also limited to row/column level
security and
although finer grained security can be achieved at the application layer, this
is a complex
approach that requires a heavy-weight update to both the database and
application
whenever the security model changes.
[0005] A need exists for methods and systems that provide for context-
specific data
set derivation from unstructured data in data storage devices that overcome
some of the
drawbacks of known techniques, or at least, provide a useful alternative
thereto.This
background information is provided to reveal information believed by the
applicant to be
of possible relevance. No admission is necessarily intended, nor should be
construed, that
any of the preceding information constitutes prior art or forms part of the
general
common knowledge in the relevant art.
SUMMARY
[0006] The following presents a simplified summary of the general
inventive
concept(s) described herein to provide a basic understanding of some aspects
of the
invention. This summary is not an extensive overview of the invention. It is
not intended
to restrict key or critical elements of the invention or to delineate the
scope of the
invention beyond that which is explicitly or implicitly described by the
following
description and claims.
[0007] The disclosure is a system and a method for granular governance
and flexible
curation of digital assets. Further, the inventive subject matter disclosed
herein provides,
in some embodiments, a flexible framework to govern and curate unstructured
and
structured data and create datasets that can be used for, as an example,
operational use
and analytics within the proper context; and further, in some embodiments, in
a manner
that can comply with changing legal and regulatory requirements. Accordingly,
granular
1082P-RRI-CAD I 2
CA 2986320 2017-11-21

management of data sets that can scale with the rapid growth of data storage
and
processing requirements, with a customizable approach to policy compliance is
required.
100081 In accordance with one aspect, there is provided a data storage
system for
fulfilling a data request for a context-specific data set, said context-
specific data set based
on a raw data set, the system comprising: a plurality of network-accessible
hardware
storage resources, each of said hardware storage resources being in network
communication and configured for distributed storage of data objects; a
digital data
processor responding to data access requests received over a network and
relating to the
data objects; a key-value store comprising a unique key-value logical row for
each
constituent data component of each of said data objects, each said unique key-
value
logical row comprising: a key for identifying said unique key-value logical
row; a
constituent data component value comprising stored digital information
relating to said
constituent data component associated with said unique key-value logical row;
and a
metadata descriptor describing metadata of said constituent data component
value;
wherein at least one key-value logical row for a given data object is a direct
key-value
logical row directly associated with the raw data set and wherein at least one
key-value
logical row for the given data object is a derived key-value logical row
derived from one
or more other key-value logical rows; wherein, upon said digital data
processor
generating the context-specific data set responsive to a given data request to
the data
storage system, said digital data processor further generates a re-
identification risk value
for the context-specific data set to be associated therewith, said re-
identification risk
value representative of a likelihood that a given constituent data component
used to
generate the context-specific data set can be directly associated with an
identifiable
subject to which said given constituent data component pertains; and wherein
said given
data request is selectively fulfilled by the data storage system as a function
of said re-
identification risk value.
[0009] In accordance with some aspects, there are provided data storage
systems
wherein said re-identification risk value may optionally be generated based on
similarities
between an aspect of the given constituent data component, and a corresponding
aspect of
at least one other constituent data component used in the context-specific
data set.
1082 P-RRI-CADI 3
CA 2986320 2017-11-21

[0010] In accordance with some aspects, there are provided data storage
systems
wherein the re-identification risk may be generated based on at least one of
the following
calculated properties of the aspect of the context-specific data set: k-
anonymity, t-
closeness, /-diversity, and privacy differential.
[0011] In accordance with some aspects, there are provided data storage
systems
wherein each key-value logical row may further comprise a sensitivity value
indicating a
sensitivity associated with a corresponding key-value logical row.
[0012] In accordance with some aspects, there are provided data storage
systems
wherein the sensitivity value may be associated with one or more of the
following: a
permissible requesting user identifier for the corresponding key-value logical
row or an
aspect thereof, a predetermined sensitivity tag associated with one or more
aspects of the
corresponding key-value logical row, the raw data sets from which the
corresponding
key-value logical row originated, or the metadata descriptor of the
corresponding key-
value logical row.
[0013] In accordance with some aspects, there are provided data storage
systems
wherein said given data request for the context-specific data set may be
generated in
response thereto selectively fulfilled solely upon said re-identification risk
value
associated with the context-specific data set being lower than a designated re-
identification risk threshold.
[0014] In accordance with some aspects, there are provided data storage
systems
wherein said re-identification risk threshold may be automatically determined
by the data
storage system based on whether a requesting computing device is within a
designated
zone of trust.
[0015] In accordance with some aspects, there are provided data storage
systems
wherein said re-identification risk threshold may be automatically determined
by the data
storage system based on one or more of: an identity of a requesting user, a
role of the
requesting user, sensitivity of data components in the context-specific data
set, a location
1082P-RRI-CAD I 4
CA 2986320 2017-11-21

of a requesting computing device, a security indication of the requesting
computing
device, or a combination thereof
[0016] In accordance with some aspects, there are provided data storage
systems
wherein at least one said derived key-value logical row may be automatically
generated
from the raw data set upon importing such raw data set into the data storage
system.
[0017] In accordance with some aspects, there are provided data storage
systems
wherein at least one said derived key-value logical row may be derived upon
request for
such derivation by a user of the system. In accordance with some aspects,
there are
provided data storage systems wherein at least one said derived key-value
logical row
may be automatically derived from one or more pre-existing direct or derived
key-value
logical rows that are associated with said given data request so to
automatically increase a
similarity between corresponding aspects of constituent data components
derived from
said one or more existing key-value logical rows and thus reduce a given re-
identification
risk value associated with a derived context-specific data set relying on said
at least one
derived key-value logical row given said similarity increase.
[0018] In accordance with some aspects, there are provided data storage
systems
wherein said derived key-value logical row may be derived by obfuscating the
constituent
data component value of said pre-existing key-value logical rows to generate
the
constituent data component value of the derived key-value logical row, and the
corresponding metadata descriptor of the derived key-value logical row being
generated
based on said obfuscating.
[0019] In accordance with some aspects, there are provided data storage
systems
wherein said derived context-specific data set may be automatically generated
by the data
storage system upon said re-identification risk value associated with a first
context-
specific data set being too high to permit selective fulfilment of said given
data request.
In accordance with some aspects, there are provided data storage systems
wherein said
derived context-specific data set may be generated automatically upon said re-
identification risk value associated with a first context-specific data set
being higher than
a first designated threshold.
1082P-RRI-CAD1 5
CA 2986320 2017-11-21

[0020] In accordance with one aspect, there is provided a data storage
method for
fulfilling a data request for a context-specific dataset based one or more raw
data sets, the
method implemented on a data storage system comprising a plurality of network-
accessible hardware storage resources, each of said hardware storage resources
being in
network communication and configured for distributed storage of data objects,
and a
digital data processor for responding to data storage requests received over a
network and
relating to said data objects, the method comprising: storing a key-value
store comprising
a unique key-value logical row for each constituent data component of each
data object,
each key-value logical row comprising: a key for identifying the key-value
logical row; a
constituent data component value comprising stored digital information
relating to the
constituent data component associated with the key-value logical row; and a
metadata
descriptor describing metadata of a data component value; directly generating
at least one
of the key-value logical rows for a given data object from raw data; deriving
at least one
of the key-value logical rows for the given data object from other key-value
logical rows;
generating the context-specific data set responsive to the data request;
generating, a re-
identification risk value for the context-specific data set, the re-
identification risk value
indicating a likelihood that a given constituent data component used to
generate the
context-specific data set can be directly associated with an identifiable
subject to which
said given constituent data component pertains; and selectively fulfilling the
context-
specific data request as a function of said re-identification risk value.
[0021] In accordance with some aspects, there are provided data storage
methods
wherein said re-identification risk value may be generated based on
similarities between
an aspect of the given constituent data component, and a corresponding aspect
of at least
one other constituent data component used in the context-specific data set.
[0022] In accordance with some aspects, there are provided data storage
methods
wherein said re-identification risk value may be generated based on at least
one of the
following calculated properties of the aspect of the context-specific data
set: k-
anonymity, t-closeness, /-diversity, or privacy differential.
1082P-RRI-CAD1 6
CA 2986320 2017-11-21

[0023] In accordance with some aspects, there are provided data storage
methods
wherein each key-value logical row may further comprise a sensitivity value
indicating a
sensitivity associated with the corresponding key-value logical row. In
accordance with
some aspects, there are provided data storage methods wherein the sensitivity
value may
be associated with one or more of the following: a permissible requesting user
identifier
for the corresponding key-value logical row or an aspect thereof, a
predetermined
sensitivity tag associated with one or more aspects of the corresponding key-
value logical
row, the raw data sets from which the corresponding key-value logical row
originated, or
the metadata descriptor of the corresponding key-value logical row.
[0024] In accordance with some aspects, there are provided data storage
methods
wherein said step of selectively fulfilling may comprise fulfilling the data
request solely
upon said re-identification risk value associated with the context-specific
data set being
lower than a designated risk threshold. In accordance with some aspects, there
are
provided data storage methods wherein said risk threshold may be determined
based on
whether a requesting computing device is within a designated zone of trust. In
accordance
with some aspects, there are provided data storage methods wherein said risk
threshold
may be determined based on one or more of: an identity of a requesting user, a
role of the
requesting user, sensitivity of data components in the context-specific data
set, a location
of a requesting computing device, a security indication of the requesting
computing
device, or a combination thereof
[0025] In accordance with some aspects, there are provided data storage
methods
wherein further comprising: automatically generating a derived context-
specific data set
to fulfil the data request, wherein the derived context-specific data set is
based on at least
one derived key-value logical row that is automatically derived from one or
more pre-
existing direct or derived key-value logical rows associated with the data
request so to
automatically increase a similarity between corresponding aspects of
constituent data
components derived from said one or more pre-existing key-value logical rows
and thus
reduce a given re-identification risk value associated with said derived
context-specific
data set given said similarity increase.
1082P-RRI-CAD1 7
CA 2986320 2017-11-21

100261 In accordance with some aspects, there are provided data storage
methods
wherein the at least one derived key-value logical row may be derived by
obfuscating the
constituent data component value of the corresponding one or more pre-existing
key-
value logical rows to generate the constituent data component value of the
derived key-
value logical row, and the corresponding metadata descriptor of the derived
key-value
logical row being generated based on said obfuscating.
100271 In accordance with some aspects, there are provided data storage
methods
wherein the derived context-specific data set may be generated upon the re-
identification
risk associated with a first context-specific data set being too high to
permit selective
fulfilment of the data request. In accordance with some aspects, there are
provided data
storage methods wherein the derived context-specific data set is generated
automatically
upon the re-identification risk value associated with a first context-specific
data set being
higher than a first designated risk threshold.
100281 In accordance with one aspect, there is provided a device for
fulfilling a data
request for a context-specific dataset based on an existing raw data set, the
device being
in network communication with a plurality of network-accessible hardware
storage
resources, each of said hardware storage resources being in network
communication, and
configured for distributed storage of data objects, the device comprising: a
digital data
processor for responding to data storage requests received over a network and
relating to
said data objects; and a network communications interface for communicatively
interfacing one or more requesting users and a key-value store configured to
store a
unique key-value logical row for each constituent data object component of
each data
object, each such key-value logical row comprising: a key for identifying the
key-value
logical row; a constituent data component value comprising stored digital
information
relating to the constituent data component associated with the key-value
logical row; and
a metadata descriptor describing metadata of the constituent data component
value;
wherein at least one key-value logical row for a given data object is a direct
key-value
logical row directly associated with raw data and wherein at least one key-
value logical
row for the given data object is a derived key-value logical row derived from
one or more
other key-value logical rows; and wherein, upon said digital data processor
generating the
1082P-RRI-CAD I 8
CA 2986320 2017-11-21

context-specific data set responsive to a given data request to the data
storage system,
said digital data processor further generates a re-identification risk value
for the context-
specific data set to be associated therewith, said re-identification risk
value representative
of a likelihood that a given constituent data component used to generate the
context-
specific data set can be directly associated with an identifiable subject to
which said
given constituent data component pertains; and wherein said given data request
is
selectively fulfilled by the data storage system as a function of said re-
identification risk
value.
[0029] In accordance with one aspect, there is provided a computer-
readable medium
having stored thereon instructions for execution by a computing device for
fulfilling a
data request for a context-specific dataset based on an existing raw data set,
said
computing device being in network communication with a data storage system
comprising a plurality of data storage components, each of said data storage
components
being in network communication, and configured for distributed storage of a
plurality of
data objects, each said data object comprising of a plurality of constituent
data object
components, the instructions executable to automatically implement the steps
of any one
of the methods disclosed herein.
In accordance with one aspect, there is provided a data storage system for
fulfilling a data
request for a context-specific data set, said context-specific data set based
on a raw data
set, the system comprising: a plurality of network-accessible hardware storage
resources,
each of said hardware storage resources being in network communication and
configured
for distributed storage of data objects; a digital data processor responding
to data access
requests received over a network and relating to the data objects; a key-value
store
comprising a unique key-value logical row for each constituent data component
of each
of said data objects, each said unique key-value logical row comprising: a key
for
identifying said unique key-value logical row; a constituent data component
value
comprising stored digital information relating to said constituent data
component
associated with said unique key-value logical row; and a metadata descriptor
describing
metadata of said constituent data component value; wherein, in response to a
given data
request, said digital data processor: generates a first context-specific data
set based on
1052P-RRI-CAD1 9
CA 2986320 2017-11-21

existing key-value logical rows; associates a re-identification risk value
with said first
context-specific data set representative of a likelihood that a given
constituent data
component used to generate said first context-specific data set can be
directly associated
with an identifiable subject to which said given constituent data component
pertains;
selectively fulfils said given data request based on said re-identification
risk value by:
providing access to said first context-specific data set upon said re-
identification risk
satisfying a designed risk criteria; otherwise automatically generating and
providing
access to a derived context-specific data set so to fulfil the data request,
wherein the
derived context-specific data set is based on at least one derived key-value
logical row
that is automatically derived from one or more pre-existing direct or derived
key-value
logical rows associated with the data request so to automatically increase a
similarity
between corresponding aspects of constituent data components derived from said
one or
more pre-existing key-value logical rows and thus reduce a given re-
identification risk
value associated with said derived context-specific data set given said
similarity increase.
100301 The system can receive unstructured or structured data as an
input. In some
cases, the input data could be acquired from a patient record, a financial
record or other
type of record and can come in several formats such as PDF, CSV or other types
of
electronic or non-electronic inputs. The input data will go through an initial
process,
sometimes referred to as an ingestion process, or alternatively referred to as
a data input
process, which consists of obtaining data from its original form in the raw
data and
storing it in a logical row in a key-value store. This original form of the
data may be
referred to as "raw data" and can include text files, PDF, CSV, spreadsheets,
etc. Upon or
even during ingestion or data input, the system automatically associates
metadata
information with the data stored in key/value pairs within a logical row in
the key value
store. This metadata data may provide information that describes the data or a
characteristic thereof, including contextual, descriptive and governance
information such
as the origin, ownership, integrity information (such integrity information,
in turn,
including but not limited to size, encoding, checksum information, and
retention
requirements information) and other governance-related information.
Embodiments of the
subject matter disclosed herein may be employed by a user to add additional
logical rows,
and thus additional metadata information regarding data, a data object, or one
or more
1082P-RRI-CAD I 10
CA 2986320 2017-11-21

other logical rows at any time after data input to further describe context or
attributes of
the data (or data object).
[0031] In some embodiments, there may also be provided a framework for
executing
distributed processing functions to curate the raw data into different forms.
Once
ingested, the collection of all logical rows in the key value store make up
all the data
relating to a given data object, and may be referred to as a digital asset.
This curation can
occur at the time of ingest or after ingest. In some embodiments, curation may
consist of
the following functions, inter alia: (1) The extraction or computation of
derived data
from the original data; and (2) The addition of context to the data. These
functions can be
considered to be generating additional metadata information associated with
the original
raw data, data object, or data relating to the data object, and which is
stored with a value
relating to said additional metadata information alongside the raw data in key
value pairs.
In this way, additional information, descriptors, context, and governance
information can
be associated with a data object and/or data asset, either at the time of
ingestion or later.
Datasets relating to a set of data objects can be generated based on the
existing metadata
information in the applicable logical rows, which can then be presented to
different users
at different times depending on context; no access to the original data is
required and the
nature and level of access may be governable in a highly customized, dynamic,
and
granular fashion.
[0032] Embodiments may also provide a distributed execution framework,
which
simplifies the process of writing distributed jobs to curate data thereby
enabling
developers who are not familiar with a distributed system to write data
processing
functions that extract, generate and store additional derived data.
[0033] Embodiments may also provide the ability to use metadata to
generate on-
demand context-specific datasets consisting of metadata and/or raw data. These
can be
exported and protected with privacy or other access and compliance rules.
[0034] Other aspects, features and/or advantages will become more apparent
upon
reading of the following non-restrictive description of specific embodiments
thereof,
given by way of example only with reference to the accompanying drawings.
1082P-RRI-CAD I 11
CA 2986320 2017-11-21

BRIEF DESCRIPTION OF THE FIGURES
[0035] Several embodiments of the present disclosure will be provided, by
way of
examples only, with reference to the appended drawings, wherein:
[0036] Figure 1 shows an exemplary architecture in accordance with one
embodiment
of the instant disclosure.
[0037] Figure 2 shows a schematic of a system in accordance with another
aspect of
the instant disclosure.
[0038] Figure 3 shows an exemplary schema of a key-value store in
accordance with
an aspect of the instant disclosure.
[0039] Figure 4 shows a conceptual schema and workflow in accordance in
accordance with an aspect of the instant disclosure.
[0040] Figure 5 shows another conceptual schema and workflow in
accordance with
an aspect of the instant disclosure.
[0041] Figure 6 shows a conceptual workflow for deriving datasets in
accordance
with an aspect of the instant disclosure.
100421 Elements in the several figures are illustrated for simplicity and
clarity and
have not necessarily been drawn to scale. For example, the dimensions of some
of the
elements in the figures may be emphasized relative to other elements for
facilitating
understanding of the various presently disclosed embodiments. Also, common,
but well-
understood elements that are useful or necessary in commercially feasible
embodiments
are often not depicted in order to facilitate a less obstructed view of these
various
embodiments of the present disclosure.
1082P-RRI-CAD I 12
CA 2986320 2017-11-21

DETAILED DESCRIPTION
[0043] Various implementations and aspects of the specification will be
described
with reference to details discussed below. The following description and
drawings are
illustrative of the specification and are not to be construed as limiting the
specification.
Numerous specific details are described to provide a thorough understanding of
various
implementations of the present specification. However, in certain instances,
well-known
or conventional details are not described in order to provide a concise
discussion of
implementations of the present specification.
[0044] Various apparatuses and processes will be described below to
provide
examples of implementations of the system disclosed herein. No implementation
described below limits any claimed implementation and any claimed
implementations
may cover processes or apparatuses that differ from those described below. The
claimed
implementations are not limited to apparatuses or processes having all of the
features of
any one apparatus or process described below or to features common to multiple
or all of
the apparatuses or processes described below. It is possible that an apparatus
or process
described below is not an implementation of any claimed subject matter.
[0045] Furthermore, numerous specific details are set forth in order to
provide a
thorough understanding of the implementations described herein. However, it
will be
understood by those skilled in the relevant arts that the implementations
described herein
may be practiced without these specific details. In other instances, well-
known methods,
procedures and components have not been described in detail so as not to
obscure the
implementations described herein.
100461 In this specification, elements may be described as "configured to"
perform
one or more functions or "configured for" such functions. In general, an
element that is
configured to perform or configured for performing a function is enabled to
perform the
function, or is suitable for performing the function, or is adapted to perform
the function,
or is operable to perform the function, or is otherwise capable of performing
the function.
1082P-RRI-CAD I 13
CA 2986320 2017-11-21

[0047] It is understood that for the purpose of this specification,
language of "at least
one of X, Y, and Z" and "one or more of X, Y and Z" may be construed as X
only, Y
only, Z only, or any combination of two or more items X, Y, and Z (e.g., XYZ,
XY, YZ,
ZZ, and the like). Similar logic may be applied for two or more items in any
occurrence
of "at least one ..." and "one or more..." language.
[0048] The systems and methods described herein provide, in accordance
with
different embodiments, different examples in which provides for the ability to
create
different views of data, using data sets derived from the actual or live data,
depending on
the contextual requirements relating to the data and/or data consumer. Such
requirements
often include privacy compliance but may also include other administrative,
analytics,
management, or use-related requirements.
[0049] In some embodiments, there are provided a methods and systems
relating to
data storage that leverage the capability of a key-value store. Some
embodiments may
utilize one or more storage devices, each of which may further comprise
storage sub-
elements (for example, a server comprising a plurality of storage blades that
each in turn
comprise multiple storage elements of the same or different types, e.g. flash
or disk).
Very large data sets may be distributed amongst many different local or remote
storage
elements; they may be closely stored (e.g. on the same device or on directly
connected
devices, such as different blades on the same server) or they may be highly
disparately
and remotely stored (e.g. on different, but networked, server clusters).
Furthermore, the
data stored may be duplicated for a number of reasons, including redundancy
and failure
handling, as well as efficiency (e.g. to store a copy of information that has
been recently
used "close" to other required data). Systems and methodologies for managing
such large
data and complex sets have been developed (e.g. HDFS for HadoopTm). Overlaying
such
complexity, however, there are disclosed methodologies, devices, and systems
for
storing, accessing, and using very large data sets using a key-value store to
ingest and
store data from raw data sources (e.g. a patient or financial record) in a
highly granular
fashion.
1082P-RRI-CADI 14
CA 2986320 2017-11-21

[0050] For a given data object, such as, for example, a patient record or
indeed a
patient, at least some if not all of the available data are ingested as
individual discrete
portions of data, along with a metadata descriptor of each portion; a key is
associated
with the entry for, in part, future identification. Accordingly, the key-value
store
comprises logical rows, wherein each logical row comprises an individual
portion of the
raw data or constituent data component value (the "value"), an identifier (the
"key"), a
metadata descriptor, a data object identifier, and, optionally in different
embodiments,
additional management information, such as authorization, sensitivity or other
compliance information and/or timestamp information. The key-value store that
comprises logical rows, wherein each logical row comprises a constituent data
component value and a key identifier may also be referred to as the key-value
pair. The
collection of all logical rows for a given data object comprises the digital
asset (typically,
the data asset will also include the raw data, however, in many embodiments,
there will
be a logical row associated wit'h the raw data; e.g. a patient record in a
text file or PDF
format). The concept of a data object may, in some embodiments, be considered
to
broader than the data asset, and refer to all information, whether existing or
potential,
regarding any entity, such as a patient, hospital, doctor, bank, transaction,
etc. In one
exemplary embodiment, considering an existing patient as a data object and a
patient
record as the raw data, a first logical row may consist of an object id
relating to the
patient, a unique identifier (the key), a metadata descriptor of "raw data",
and a value
being the patient record data file itself; from the raw data file, additional
logical rows are
created for every discrete portion of raw data. Additional logical rows can
then be derived
from the existing logical rows as well other applicable information; for
example, derived
logical rows corresponding to existing logical rows can be generated that
aggregate or
obfuscate existing logical rows. When combined with specific other logical
rows, any of
the existing logical rows, either imported (i.e. ingested) or derived (i.e.
curated), can be
provided along with ¨ or excluded from - access requests associated with the
derived
logical row. Because there are very few limits on how such derived logical
rows can be
generated, and all of the data of the data asset are highly granularized to
individual
discrete pieces of data, provision and use of data associated with any given
data object (or
class or group of data objects) can be managed at the level of each such piece
of data.
1082P-RRI-CAD I 1 5
CA 2986320 2017-11-21

That is, far below the level of the data object or table level as would be the
limitation in
state of the art systems. It some embodiments, the value-portion of a given
logical row
may be the actual value (or data), or it may be a reference, direct or
indirect, to the value
and/or storage location of the value.
[0051] In embodiments, a key-value store may be employed for granular
governance
and flexible curation of digital assets. Embodiments hereof can receive
unstructured or
structured data as an input. In some cases, the input data could be acquired
from a patient
record, a financial record or other type of record and can come in several
formats such as
PDF, CSV or other types of electronic or non-electronic inputs. In accordance
with one
aspect, a key-value store is a data storage structure designed for storing,
retrieving, and
managing associative arrays, which contains a collection of objects or
records, which in
turn has different fields within them, each containing data. In some
embodiments, the
data included in a data collection will have related attributes so that the
data can be
stored, retrieved and managed in an efficient manner and this data collection
can be
derived, generated or calculated after or during curation. These records are
stored and
retrieved using a key identifier that uniquely identifies the record, and is
used to quickly
find data within a database. In addition to storing, retrieving, and managing
associative
arrays using the key identifier, disclosed implementations of the key-value
store allow
generation of context-specific datasets that are generated from the key-value
store itself
(keeping in mind that in some embodiments the "value" portion of a logical row
can be
the associated piece of data, or a reference thereto). Such generated datasets
may be
based on further utilization of additional descriptors and indicators,
depending on the data
access request.
[0052] In some embodiments, raw data may comprise any type of raw data in
various
formats, including PDF files, text files, CSV, database information, and
spreadsheet
documents, is extracted and stored as a data object comprising a key-value
logical row,
which comprises at least constituent data component values and the associated
metadata
descriptors. The data object is associated with the raw data, as well as all
other logical
rows that have been or may be created. Multiple and separate records relating
to a data
object, e.g. a patient, may constitute an example where a data object may be
associated
1082P-RRI-CAD I 16
CA 2986320 2017-11-21

with more than one raw data set. In some embodiments, at run-time and/or
subsequent to
the ingestion or receipt of the raw data, metadata of the raw data are
collected, derived, or
formulated and are stored as key-value logical rows, with its unique key,
constituent data
component values and associated metadata descriptor. In embodiments, the
metadata
associated with a given logical row is a type of data that describes and gives
information
about the data to which the logical row pertains. For example, the metadata
could be "raw
data", "file type", "patient ID", "name", with the value associated therewith,
as extracted
from the raw data or a derived from other data, stored in the same logical
row. Each
collected, derived, or formulated key-value entry is stored in the key-value
data store as a
key-value logical row, the rows collectively forming a data asset or a portion
thereof.
Examples of metadata include the name of the file, the type of the file, the
time the file
was stored, the raw data itself, and the information regarding who stored the
file. The
collected information gets parsed and saved in the key-value store as a key-
value logical
row with its respective key for unique identification, constituent data
component value,
and metadata descriptors. Concurrent to the collection of the information, at
the run time
or at subsequent times when the raw data exists in the key-value store, the
raw data may
be parsed for acquisition of metadata. The acquired metadata are stored in the
key-value
store with respective key for unique identification, constituent data
component value, and
metadata descriptors. The metadata preliminarily derived are saved as key-
value logical
rows in the key-value store, key-value logical rows collectively forming a
data object
associated with a raw data. First name, last name, type of disease, date of
financial
transaction and age are some examples of the acquired data. Furthermore,
derived
metadata may be derived from other logical rows, including either raw data,
acquired data
from the raw data, or other derived data; in some embodiments, it may be
derived from
other information associated with a data object, rather than directly from the
existing data
asset. The metadata associated with derived logical rows are stored in the key-
value store
as part of the key-value logical rows with the logical row unique identifier
(such unique
identifier being a unique key), a data object identifier, and constituent data
component
value. In some embodiments, metadata may be employed to formulate and output
context
and requestor specific dataset. For example, a data set may be generated from
a key-value
store by accessing only obfuscated logical rows, as well as other lower-
sensitivity (or
1082P-RRI-CAD I 17
CA 2986320 2017-11-21

other access criteria); accordingly, a derived data set that is separate from
the raw data, or
even the key-value store data is specifically produced for a certain context ¨
and that
context may be determined or created by generating specific types of logical
rows based
on pre-determined metadata. Another example may include a patient dataset
where a
derived logical row includes an age range, or first three digits of a postal
code, and the
resulting derived dataset is generated by accessing all non-identifying
information
regarding disease types and outcomes for a group of patients along with the
aforementioned derived logical row; without providing access to the raw data,
an analysis
of the dataset can be performed wherein disease frequency by age or location
can be
assessed without giving any direct access to sensitive information. As the
logical rows
can be generated before ingestion for automatic curation or after for more
customized
curation, dataset creation can be dynamic and compliant irrespective of the
type of
information stored regarding data objects.
100531 In some embodiments, the use of a key-value store paradigm, such as
Apache
Accumulo, may be used to provide granular access control to the data. The use
of a key-
value store, such as Accumulo, provides cell-level security with a visibility
field in the
key. The use of a key-value store paradigm is a data model that stores raw
data in a key-
value pair and metadata values in the same logical row as additional key-value
pairs. The
column visibility field is used to store data attributes related to governance
or compliance
rules specified by the user.
100541 In some embodiments, the constituent data component value may
comprise
stored digital information directly, or point to a location in storage where
the digital
information is stored. In some embodiments, the metadata descriptors may be
formed in
response to data access request. In some embodiments, the data access request
would
comprise of pre-determined metadata descriptors and new metadata descriptors
either by
system administrator or end-user (i.e. request for a specific use and/or
context). In some
embodiments, the pre-determined metadata descriptors are the result of
processing the
raw data; these functions are sometimes referred to as data processing
functions (DPF).
Each data processing functions is associated with a specific timestamp or
version for all
of the components that result from the processing. This associated timestamp
is included
1082P-RRI-CAD I 18
CA 2986320 2017-11-21

in the key-value store and is similar to a version control feature. In some
embodiments,
this version control feature can allow for version roll back to a previous
processed state
and/or specific application of rules or data management of a processed
dataset. Such
timestamps can provide a mechanism to assess how a dataset changed over time
as the
state of the dataset can be assessed as it was at any point in time.
[0055] In some embodiments, the data can be accessed directly through an
application programming interface (API), which can be a set of routines,
protocols and,
tools for building software applications. These direct access requests may
occur through a
library call for programmatic access in data science or a call through a
representational
state transfer (REST) API when accessing the data for an application. A query
using these
examples of direct data access may trigger a distributed routine to collect
the data across
various nodes. In another embodiment, the data may be access through a
manufactured
datasets and use the distributed compute capability of software tools, such as
Accumulo
and/or Spark, on the cluster to create batch jobs that use metadata
descriptors to assemble
the necessary dataset and to generate said dataset into the format requested.
In some
embodiments, this dataset may be exported to a specified location to meet
governance,
privacy and/or compliance requirements.
[0056] The process of authorization regarding data access requests may be
simplified
for the administration by using tags, attributes and, expressions, which
provides
administrators with the ability to specify tags, attributes or expressions on
the data at a
high level. For example, using the Accumulo software will provide users with a
visibility
field that allows the use of arbitrary attributes such as PHI, PUBLIC and, DE-
IDENTIFIED, which can then be assigned to users/groups for authorization. In
addition,
the use of Active Directory (AD) groups may be used to link users/groups to
authorizations. In one exemplary embodiment, a customer may define a rule to a
group
called "researchers" in a specified AD location, such as "researcher
authorization allows
you to see data with attributes PUBLIC and DE-IDENTIFIED". The Accumulo
infrastructure allows user attributes identified for users/groups to be
defined and used in
the same way; this attribute-based access control would authorize
users/groups/AD with
particular attributes to access data with particular attributes. In addition,
there is a priority
1082P-RRI-CAD I 19
CA 2986320 2017-11-21

order of evaluation for rules in the case where the administrator specifies
several rules
that overlap.
[0057] In accordance with one aspect, the employment of a key-value store
permits
the storage and operation on at least four types of data, collected or
derived, when a raw
data is received or exists in the key-value store: metadata descriptive of the
raw data (e.g.
the raw data file itself, file name, file type, file size, etc.), metadata
derived from the raw
data (e.g. patient name data from a the corresponding patient name field
within the raw
data file); metadata derived from the preliminarily derived metadata (e.g. a
pre-
determined category, such as age group where the value for such derived
logical row is
determined from another existing logical row where the metadata descriptor is
age); and
governance metadata (e.g. retention policies, authorization, owner, etc.). In
some
examples, the metadata derived from the raw data may be referred to as the
tokenization
of the original data; this refers to any operation to data associated with a
data object,
including other logical data rows, in order to protect, analyze, or generate
new data from
the existing raw data or generated data at a granular level. This tokenization
can include
obfuscation, aggregation, computation, and the application of filters.
Employing the
metadata, the key-value store therefore allows formulation of datasets and
access thereto
based on context- and requestor-specific characteristics.
[0058] Each key-value logical row gets assigned a unique key for
identification. In
some embodiments, all key-value logical row associated to a given set of raw
data may
be assigned a unique key for identification. In some embodiments, all key-
value logical
rows associated with a data object may be assigned a unique key for
identification. In
other words, in some embodiments, when an example of disclosed system stores a
raw
data, it may assign a unique key identifier, grouping the metadata associated
to the raw
data as a single logical entity, or grouping the metadata associated to a data
object
associated to at least one raw data as a single logical entity. Each collected
or derived
datum with its unique key, associated metadata descriptors and corresponding
constituent
data component value is stored as a key-value logical row in the key value
store. In some
embodiments, examples of the metadata descriptors for each collected or
derived datum
1082P-RRI-CAD I 20
CA 2986320 2017-11-21

include an accessibility authorization and/or sensitivity descriptor and time-
sequenced
information, temporal-/locality-based associations.
[0059] Since key values can be used for, among other reasons, identifying,
locating,
and securing access to data objects, data can be indexed and accessed based on
the
existence of certain metadata, (1) data can be quickly accessed and located
based on the
existence of specified metadata within the key value store; (2) derived data
sets can be
generated directly from the key-value stored; and (3) regulatory and
administrative
compliance can be enforced at a data storage layer (as opposed to at an
application layer).
[0060] In various embodiments of the system, key-value store is employed
for
granular governance and flexible curation of digital assets.
[0061] In an exemplary embodiment, there is provided a data storage system
for
generating context-specific datasets based on existing raw data sets. The data
storage
comprises of a plurality of data storage components and a processing
component.
[0062] The plurality of data storage component exists in a network
communication
and is configured for distributed storage of a plurality of data objects,
wherein each said
data object comprises of a plurality of constituent data object components. An
example of
the plurality of data objects include a set of data related to or derived from
either
unstructured or structured data received by the system as an input. A
constituent data
object component includes each set of data that form a part of the data object
and may be
generated automatically derived under system command, or formulated based on
unique
requests.
[0063] The processing component has a data object key value store
accessible thereto,
wherein the data object key value store stores a unique key-value logical row
for each
constituent data object component. In other words, each constituent data
object
component is stored in the data object key value store, as a unique key-value
logical row.
[0064] Furthermore, each key-value logical row comprises: a key for
uniquely
identifying the key-value logical row; a constituent data object component
value for
providing component information relating to the constituent data object
component
1082P-RRI-CAD I 21
CA 2986320 2017-11-21

associated with the key-value logical row; and a meta data descriptor for
describing a
data object component characteristic of the constituent data object component
value. An
example of a key for uniquely identifying the key-value logical row includes a
unique
identifier for all the data generated, derived, or formulated from an input
received. An
example of the constituent data object component value may entail actual
values for a
given constituent data object component; where an example of a metadata
descriptor
include names and age.
[0065] The system may derive at least one of the constituent data object
components.
The system may further employ at least one of the constituent data object
component
values and derive at least one constituent data object component. In other
words, the
system may preliminarily derive constituent data object components. Then,
using the
values of the preliminarily derived constituent data object components, the
system may
further derive other constituent data object components. This operation may be
performed
by the system upon requests to the processing component, wherein the request
triggers
access to constituent data object component values comprising metadata
descriptors.
[0066] In some embodiments, each key-value logical row embeds additional
management information, such as an access authorization value for restricting
access to
the constituent data object component values, in response to requests
associated with a
corresponding authorization. This access authorization value can also be a
sensitivity tag
or other compliance and/or governance information and/or timestamp
information. The
access authorization value or sensitivity tag can correspond with a user
identity, user role
and/or a user group, restricting access to the constituent data object
component values.
Some examples of constituent data objects may include restricting access to
patient
records, financial data or proprietary, confidential or sensitive data. Some
examples of
user roles, user identity or user groups may include doctors, researchers,
banks, and
underwriters. In some embodiments, the restriction of the constituent data
object
component values will be based on governance and/or compliance rules such as
data
retention, storage requirements, and data ownership. In another embodiments,
rules
associated with timestamp information or version control information can be
used to
restrict access to the constituent data objects. Some examples of using
timestamp
1082 P-RRI-CAD I 22
CA 2986320 2017-11-21

information may include restricting access to the most recent version of
constituent data
objects or limiting access to older versions of constituent data objects.
[0067] In another exemplary embodiment, at least one of the constituent
data object
components for a given key-value logical row are derived from the input raw
data
automatically upon storing the raw data associated with the data object in the
data storage
components. In one embodiment, the derived data sets may be associated with a
set of
pre-determined rules, or data processing functions (DPF), which can be used to
produce
metadata descriptors to the raw data or to add timestamp information or
version control.
The derivation may take place under pre-determined requests, under data access
requests,
or by system administrator, both at run time or at subsequent times.
[0068] In another embodiment, these rules can be created during ingestion
of the data
or after the data was already ingested. In some embodiments, these data
processing
functions (DPF) are developed using a general purpose programming framework,
such as
Spark and/or MapReduce, which enables curation functions to be run across the
data
constituent data objects.
[0069] In accordance with one aspect, there is disclosed a data storage
system for
generating a context-specific data set based on a raw data set. A raw data set
may include
different formats of documents that may be provided to the data storage
system. The
context-specific data set is generated based on the raw data set, in
accordance with
specific requisitions made of the data storage system.
[0070] In accordance with one aspect, the data storage system comprises a
plurality
of network-accessible hardware storage resources, a digital data processor,
and a key-
value store. The plurality of network-accessible hardware storage resources is
in network
communication and configured for distributed storage of data objects. The data
objects
may include any type of data obtained, derived, formulated, and related to,
including the
raw data itself, upon the receipt of the raw data by the data storage system.
The digital
data processor responds to data access requests received over a network,
relating to the
data objects. Said data access requests may come from end-users regarding the
data
objects stored in the data storage system. The key-value store is stored in
said hardware
1082P-RRI-CAD I 23
CA 2986320 2017-11-21

storage and composed of a unique key-value logical row for each constituent
data
component of each of the data object in the data storage system. In accordance
with one
aspect, a data storage system may contain a number of data objects, which may
be
composed of constituent data components, related to a raw data set. In some
embodiments, a set of data objects or a data object may be related to a raw
data set
provided to the data storage system. The data object may be composed of
constituent data
components that were received, derived, or formulated at the time of, or
subsequent to the
receipt of the raw data at the data storage system. These constituent data
components may
include various characteristics and information regarding the raw data itself,
the data
derived from the raw data, and the data formulated from the data regarding the
raw data
or derived from the raw data under given requisitions.
100711 Each said unique key-value logical row is composed of a key for
uniquely
identifying said unique key-value logical row, a constituent data component
value, and a
metadata descriptor. In some embodiments, the key for unique identification of
said
unique key-value logical row may be a value comprising stored digital
information. In
some embodiments, the key may be formulated from said constituent data
component
associated with said key-value logical row and a metadata descriptor. In some
embodiments, the key may be a combination or combinations of constituent data
component values and metadata descriptors. The constituent data component
values
comprise stored digital information relating to said constituent data
component associated
with said unique key-value logical row. This digital information may be a
value directed
obtained, derived, or formulated from the raw data received. In some
embodiments, the
digital information may store a value indicative of location of where the
actual value is
stored. Examples of the digital information include actual first name such as
John and a
pointer value to a designated location in a data storage. The metadata
descriptor describes
metadata of said constituent data component value. Metadata generally comprise
data
information that provides information about other data. In some embodiments,
metadata
describes a resource for purposes such as discovery and identification,
including elements
such as title, abstract, author, and keywords. In accordance with one aspect,
metadata
describes containers of data and indicates how compound objects are put
together,
examples of which include types, versions, relationships and other
characteristics of
1082P-RRI-CAD I 24
CA 2986320 2017-11-21

digital materials. In some embodiments, metadata provides information to help
manage a
resource, such as when and how it was created, file type and other technical
information
and who can access it.
100721 In accordance with one aspect, at least one key-value logical row
for a given
data object is directly associated with the raw data set and at least one key-
value logical
row for the given data object is derived from one or more other key-value
logical rows.
Examples of directly associated key-value logical row include the data
obtained at the
time of the receipt of the raw data, such as file name and file type, and the
data derived at
the run time or at subsequent times, such as first name and last name. In some
embodiments, said key-value logical row derived from one or more other key-
value
logical rows may be derived based on end-user requisitions. In some
embodiments, the
key-value logical row derived from one or more other key-value logical rows
may be
derived based on data administrator of the data storage system.
[0073] In accordance with one aspect, in response to a given data access
request
based on a given metadata descriptor, said digital data processor generates an
independent data set via said key-value store by accessing those key-value
logical rows
having metadata descriptors responsive to said data access request. In some
embodiments,
the given metadata descriptor may be pre-determined by the system
administrator or
customized metadata by the end-user. In some embodiments, the metadata
descriptors
responsive may include metadata descriptors created at the run time, at
subsequent times
when said key-value logical rows were derived or formulated, or when a
requisition
based on the given metadata descriptor is made.
[0074] In some embodiments, said key-value logical row comprises an access
authorization value for restricting access to the corresponding key-value
logical row. In
accordance with one aspect, the access authorization value may be stored
digital
information. In accordance with one aspect, the access authorization value may
be a
combination or combinations of constituent data component values and metadata
descriptors. In some embodiments, the access authorization value may be
employed for
1082P-RRI-CAD I 25
CA 2986320 2017-11-21

generation of the independent data set, in response to a given data access
request,
allowing control over the information accessed and the independent data set
generation.
[0075] In some embodiments, examples of factors that may be associated
with the
access authorization include a requesting user identity, a requesting user
role, a
requesting user group, the constituent data component of the corresponding key-
value
logical row, the raw data sets from which the corresponding key-value logical
row
originated, and the metadata descriptor of the corresponding key-value logical
row. In
some embodiments, access authority would be determined at the time of the data
access
request, or at subsequent times, based on the above-noted factors.
[0076] In accordance with one aspect, said independent data set returned
in response
to said data access request is stored in the data storage system. In some
embodiments, the
independent data returned is not stored in the data storage system thereafter.
[0077] In accordance with one aspect, at least some of the key-value
logical rows are
automatically generated from the raw data set upon importing such raw data set
into the
data storage system. Examples of the key-value logical rows that are
automatically
generated from the raw data set upon importing such raw data set into the data
storage
system include file name and file type. In some embodiments, some of the
derived key-
value logical rows are derived upon a request for such derivation by a user of
the data
storage system. Examples of the key-value logical rows that are derived upon a
request
for such derivation by a user of the data storage system include first name
and last name.
In some embodiments, additional key-value logical row is derived by
obfuscating the
constituent data component value of at least one existing key-value logical
row to
generate the constituent data component value of the additional key value
logical row,
and the corresponding metadata descriptor of the additional key-value logical
row being
generated based on said obfuscating. Examples of obfuscating include
deliberate
rendering of age obscure so as to not disclose the precise age, but place
other key-value
logical rows related to the same raw data as the mentioned age key-value, and
make
available for data access requisition with access authority for generation of
an
independent data set.
1082P-RRI-CAD I 26
CA 2986320 2017-11-21

[0078] In accordance with one aspect, an additional key-value logical row
may be
derived by aggregating the constituent data component values of at least two
existing
key-value logical rows to generate the constituent data component value of the
additional
key-value logical row, and the corresponding metadata descriptor of the
additional key-
value logical row being generated based on said aggregating. Examples of
aggregating
include aggregating first name and last name to formulate an additional key-
value logical
row, with the corresponding metadata descriptor name. In some embodiments,
examples
of aggregating include aggregation of key-value logical rows related to a data
object,
associated to a raw data.
[0079] In accordance with one aspect, an additional key-value logical row
may be
derived through a function-based calculation based on the constituent data
component
values of at least one existing key-value logical row to generate the
constituent data
component value of the additional key-value logical row, and the corresponding
metadata
descriptor of the additional key-value logical row being generated based on
said function-
based calculation. Examples of said function-based calculation may include
decision-
making scheme, mathematical function, and other rules, to come up with
additional key-
value logical row and the corresponding metadata based on existing key-value
logical
rows.
[0080] In accordance with one aspect, where an additional key-value
logical row is
derived by obfuscating the constituent data component value of at least one
existing key-
value logical row to generate the constituent data component value of the
additional key-
value logical row, and the corresponding metadata descriptor of the additional
key-value
logical row, by said obfuscating, the access authorization value of the
additionally
derived key-value logical row may be the same as the existing key-value
logical rows,
from which the additional key-value logical row was derived, or different. In
some
embodiments, the access authorization for additional key-value logical row may
be pre-
determined in association with one or more of the following: a requesting user
identify, a
requesting user role a requesting user group, the constituent data component
of the
corresponding key-value logical row, the raw data sets from which the
corresponding
key-value logical row originated, and the metadata descriptor of the
corresponding key-
1082P-RRI-CAD I 27
CA 2986320 2017-11-21

value logical row. In some embodiments, the access authorization for
additional key-
value logical row may be determined by the system administer or data access
requestor.
100811 In accordance with one aspect, there is disclosed a data storage
method for
generating context-specific datasets based on a raw data sets, the method
implemented on
a data storage system comprising a plurality of network-accessible hardware
storage
resources, each of said hardware storage resources being in network
communication and
configured for distributed storage of data objects, and a digital data
processor for
responding to data storage requests received over a network and relating to
said data
objects. In some embodiments, the context-specific datasets based on raw data
sets are
generated upon receipt of data access requests by end-users of the method.
Examples of
the data access requests may include specific requests for age range data for
all the data
objects in the data storage. Examples of network-accessible hardware storage
may
include spinning disks connected for distributed data storage. The method
comprises
storing a key-value store in one or more said hardware storage resources,
directly
generating at least one of the key-value logical rows for a given data object
from raw
data, deriving at least one of the key-value logical rows for the given data
object from
other key-value logical rows, and generating, in response to a data access
request based
on one or more metadata descriptors, an independent data set via said key-
value store by
accessing those key-value logical rows having metadata descriptors responsive
to said
data access request. In some embodiments, the key-value store comprises a
unique key-
value logical row for each constituent data component of each data object.
Constituent
data component of each data object, with each data object related to at least
one raw data,
may include information about the raw data, such as file name and file type,
information
derived from the raw data, such as first name and last name, and information
formulated
through aggregating, employing function-based calculations, or responding to
data access
requests. Each key-value logical row comprises a key for uniquely identifying
the key-
value logical row, a constituent data component value comprising stored
digital
information relating to the constituent data component associated with the key-
value
logical row, and a metadata descriptor describing metadata of a data component
value.
The key for unique identification may be stored digital information, which may
be a
combination or combinations of constituent data component values and metadata
I 082P-RRI-CAD I 28
CA 2986320 2017-11-21

descriptors describing metadata of a data component value. The constituent
data
component may be an actual value or a pointer to the location of the storage
where the
actual value is stored. The key, the constituent data component and the
metadata
descriptor may be created, derived, or formulated at the run time or at
subsequent times,
in some embodiments pre-determined, in some embodiments under data access
requests,
and in some embodiments, by system administrator. In accordance with one
aspect, at
least one of the key-value logical rows for the given data object may be
derived from
other key-value logical rows. The derivation may take place under pre-
determined
requests, under data access requests, or by system administrator, both at run
time or at
subsequent times. In some embodiments, data access request is a request for
data, which
may be automatic, pre-determined, or user specific. For example, the data
access request
may be made by the end user or the system administrator. In another example,
the data
access request may be received at the run time or at subsequent times with the
data object
existent in the system.
100821 In accordance with one aspect, there is disclosed a device for
generating
context-specific datasets based on existing raw data sets, the device being in
network
communication with ta plurality of network-accessible hardware storage
resources, each
of said hardware storage resources being in network communication and
configured for
distributed storage of data objects. The device comprises a digital data
processor and a
network communication interface. In some embodiments, the digital data
processor
responds to data storage requests received over a network and relating to said
data
objects. In some embodiments, the network communications interface
communicatively
interfaces one or more requesting users and a key-value stored on one or more
of said
hardware storage resources. The key-value store configured to store a unique
key-value
logical row for each constituent data object component of each data object
comprises a
key, a constituent data component value, and a metadata descriptor. At least
one of the
key-value logical rows for a given data object is directly associated with raw
data and at
least one of the key-value logical rows of the given data object is derived
from one or
more other key-value logical rows. In response to a data access request based
on a given
metadata descriptor, the digital data processor generates an independent data
set from the
1082P-RRI-CAD I 29
CA 2986320 2017-11-21

key-value store by accessing those key-value logical rows having metadata
descriptors
responsive to said data access request.
[0083] In one exemplary embodiment, there is provided a system that
consists of two
manager nodes plus Hadoop-based cluster nodes, wherein each Hadoop-based
cluster
node in this exemplary system may comprises of computing devices that may be
classified as either or both Hadoop master nodes and Hadoop data nodes. It
should be
noted that in other embodiments there may be one manager node or a plurality;
in either
case, the master node functionalities described below may be carried out by a
single
master node, or distributed in various manners across the plurality, and that
the subject
matter hereof is not limited to systems with two manager nodes. The manager
nodes may
carry out the following functions: runs any centralized applications that
manage the data
storage and access functions (including management of the key-value store);
provides the
web and other (e.g. REST) interface for data administration, privacy,
security, and
governance functions; hosts the any web, proxy, or other server functionality
(e.g.
NGINX); manages and runs the master key distribution for administrators or
other
service principals (e.g. MIT Kerberos Key Distribution Center); runs data
analysis or data
function applications or libraries (e.g. the PHEMI Data Science Toolkit,
Spark, and
Zeppelin); manages and runs the slave key distribution for administrators or
other service
principals (e.g. MIT Kerberos Key Distribution Center); and hosts backup
components
for any other manager node in case of critical failure thereof.
[0084] Referring to Figure 1, there is shown a conceptual schematic of a
reference
configuration or architecture of the above-mentioned Example 1. In the
embodiment
shown 100, there are two Manager Nodes 110, 120 and a Hadoop-cluster 130.
Manager
Node 1 110, which may be run as a single tenant deployment (e.g. a bare-metal
deployment) and/or as a virtualized system (e.g. a VM on a cloud-based or
multi-tenant
server), comprises the following: a management component 112 manages and runs
the
data deployment, on the digital data processor of Manager Node 1 110, storage
and
operational functions, and otherwise facilitates the implementation of the
methods
disclosed herein, using, for example, the PHEMI CentralTM software; a web
server, proxy
server, and/or load balancer functionality component 112, which may include
NGINX;
1082P-RRI-CAD I 30
CA 2986320 2017-11-21

and a master key distribution and management component 113, which may include
MIT
Kerberos, for example. The second management node in this example, Manager
Node 2
120, comprises: a supplemental management node 122 comprising a digital data
processor for either customized data analysis as well as supplemental or
complementary
(e.g. for compute or communicative load balancing) management and running of
the data
deployment, storage, and operational functions, including, in some cases, to
take
instructions from or to work with Manager Node 1 110); a cluster-computing
management function node 121 which implements management of distributed and/or
clustered data storage resources, and may include an interface for programming
data
clusters and providing, for example, fault tolerance, redundancy, and
parallelism (and
may not be limited to Hadoop-based clusters and HDFS systems, but may interact
with
other distributed storage systems, including MapR-FS, Cassandra, OpenStack
Swift, Amazon S3, Kudu, or a custom solution or file system); and other
functional nodes
123 for implementing pre-existing or customized functional tools (e.g.
Zeppelin for data
analysis of large data sets). In some cases, the supplemental management node
122 is
used to generate data collections and datasets from stored data that can be
loaded into
formats compatible with, or facilitated for, communication, other application
layer
functionalities, or analysis tools (e.g. Spark DataFrames). As noted above,
these nodes
can be distributed across one or more master nodes in different combinations.
100851 Further referring to Figure 1, there is conceptually shown a
Hadoop-based
cluster 130. The cluster 130 will comprise of Hadoop master nodes (or more
generally in
Hadoop and non-Hadoop examples, a Master Node) and Hadoop data nodes (or more
generally in Hadoop and non-Hadoop examples, a Data Node). In general, the
master
nodes will comprise of specially-programmed networked computing devices that
provide
distributed task management and process orchestration amongst the data nodes.
This task
management and process orchestration may include data compute functional
modules 132
and data management functional modules 133. The data management functional
modules
133 may include, for example, MapReduce, and YARN, but also includes other
programming interfaces and systems for managing and scheduling computing
resources
across storage resources. Some embodiments may provide for database-relating
functionality, including across distributed data nodes, such as those provided
by using
1082P-RRI-CADI 31
CA 2986320 2017-11-21

database management components 133 which may implement, for example, MongoDB.
Such database management component 133 services implement management
instructions
from the data processor management nodes 112, 122 (e.g. the PHEMI Central
configuration information). In embodiments using MongoDB, the MongoDB service
runs
in a phemi mongo container (or other contained or virtualized implementation,
e.g. jail,
VM) on all master nodes, running as a multi-member replica set. The data nodes
will in
general be the "workhorses" of a Hadoop cluster, where data is primarily
stored and
processed. Data nodes may in some embodiments consist primarily as resources
having
multiple attached disk drives for local storage and access to cluster data. In
Figure 1, the
storage functional modules 131 are implemented across the data notes using
HDFS using
Acumulo.
[0086] In Example 1, referred to above, the PHEMI Central software uses
the
Security-Enhanced Linux (SELinux) implementation of Red Hat Enterprise Linux
7.3
operating system. The PHEMI Central application includes the following
components
running on Manager Node: (i) PHEMI Central: the PHEMI Central application runs
as
the PHEMI Agile service which runs in the phemi_central Docker container on
Manager
Node 1 (for resilience, the container and service are also provisioned on
Manager Node
2); (ii) NGINX: the NGINX service manages, redirects, and filters network
traffic to the
correct endpoints, and which runs in the phemi_nginx container on Manager Node
1 (for
resilience, the container and service are also provisioned on Manager Node 2);
(iii)
Kerberos Key Distribution Centre (or Kerberos KDC): PHEMI Central requires a
Kerberos KDC in the enterprise Active Directory to manage principals and key
distribution for end users. In addition, PHEMI Central hosts an MIT Kerberos
KDC
server to store principals and distribute keys for system services. The local
Kerberos
KDC is hosted on Manager Node 1, with a second KDC configured on Manager Node
2
for high availability. The PHEMI Central internal KDC operates in a
relationship of
cross-realm trust with Active Directory's KDC. In this exemplary embodiment,
the
PHEMI Central application also includes a Dockerized component running across
master
nodes, running MongoDB in coordination with the containers on Manager Node 1.
1082P-RRI-CAD I 32
CA 2986320 2017-11-21

100871 Referring to Figure 2, there is shown a schematic of one embodiment
of a
system in accordance with the present disclosure. In accordance with this
exemplary
embodiment there are shown two manager nodes (mgrO 1 and mgr02) 210A and 210B,
three master nodes (mas01 through mas03) 215A, 215B, and 215C, and four data
nodes
(dn001 through dn004) 280A, 280B, 280C, and 280D. System drives and other non-
data
drives (not shown) that provide storage to non-data nodes operational purposes
(e.g.
storage used by manager or master nodes) are either redundant or RAIDed,
depending on
the deployment. Figure 2 shows Cluster nodes having restricted access to the
following: a
GbE network interconnect 250; DNS 260 and NTP 270 functionality; Cloudera Key
Trustee Server and Key Trustee KNIS key management entities 240 (or other key
management services); Kerberos KDC 230 comprising of services implementing
ticket-
granting server and authentication server, or services implementing
instructions
therefrom; an Active Directory/LDAP server 220 for user authentication. In
some cases, a
secure network connection should be used from the hosted machines to the
customer
network or end-user devices.
100881 In some embodiments, HDFS (or other big data/distributed file
system)
utilizes replication for fault-tolerance, fault-recovery, or process
efficiency, with every
block of data automatically replicated on multiple data nodes. A duplicate can
be used as
back-up, or different copies can be used as the "live" copy for data requests
depending on
resource availability and performance (although the latter purpose further
requires
updating across all duplicates in a different manner than when using
duplicates as back-
up). In addition, some embodiments may make a number of services available
across a
cluster. Services may be deployed across manager nodes, master nodes, and data
nodes
and therefore resiliency of services (including, for example, Hadoop services
and services
specific to PHEMI Central, such as the PHEMI Central application and PHEMI
Raindrop, NGINX, MongoDB, and others, as well as encryption keys, may be
provided
through redundant provisioning and failover scripts. In Hadoop-based
embodiments,
HDFS in general triplicates data by default, with each block of a file or of
data
isreplicated on three different machines. This means each block of data can be
recovered
with N + 2 redundancy. Different redundancy can be used. In some embodiments,
clusters are protected from the customer data center and external environment
using
1082P-RRI-CAD I 33
CA 2986320 2017-11-21

firewall rules. In general, however, there are no firewall rules within the
cluster. Each
PHEMI cluster node has unrestricted access to every other PHEMI cluster node,
although
in some embodiments an intra-cluster firewall may be implemented.
[0089] Referring again to Figure 2, each exemplary Manager Node 210A and
210B
comprises the following hardware (although other similar arrangements are
possible):
dual power supplies; 2 x Intel Xeon v4 8-core, 2.5 GHz or better; 128 GB RAM,
in 16
GB DIMM increments; OS: 2 x 120 GB SAS/SSD disks, RAID-1 configured;
/var/logs:
500 GB disk; and /var/data/phemi: 250 GB disk. Each shown master node 215A,
215B,
215C comprises the following hardware (although other similar arrangements are
possible): Dual power supplies; 2 x Intel Xeon v4 8-core, 2.5 GHz or better;
128 GB
RAM, in 16 GB DIMM increments; OS: 2 x 120 GB SAS/SSD disks, RAID-1
configured; and /var/logs: 500 GB disk. The shown four data nodes 280A-D
should be
deployed with a RAID-10 configuration, each having a total size of 2 TB of the
following
types (285A-D): /var/data/phemi: 500 GB disk; NameNode: 500 GB disk (28;
JournalNode: 500 GB disk; and Zookeeper: 500 GB disk. Each of the disks 285A-D
may
comprise spinning hard drives, flash, SSDs, or other types of data storage
media.
[0090] In some systems, the ratio of master nodes to data nodes may be
balanced in
order to balance data storage and compute functions. In many exemplary
configurations,
these concerns are balanced. However, in other embodiments data storage nodes
having
more data storage resources (e.g. additional data drives) may be used as the
amount of
data increases to create a more storage-intensive system. On the other hand,
data nodes
may be added having more RAM or more powerful data processing components may
be
used for more compute-intensive systems or applications. In some embodiments,
where
data nodes are virtualized, VMs may be apportioned on the fly with more
storage, for
storage-intensive applications, or more RAM for more compute-intensive
applications. In
a balanced compute configuration, balanced CPU, memory, and storage may be
preferred. Balanced compute may be preferable for the following applications:
Dataset
manufacture; Ingest of complex file types at modest rates; Data science with
fewer than 5
concurrent users; Proof-of-concept or pilot deployments where load profiles
are not well
understood. In an exemplary embodiment of a balanced storage-compute
configuration,
1082P-RRI-CADI 34
CA 2986320 2017-11-21

four data nodes could be exposed to offer 12 TB of usable space, with a
compute capacity
of 64 cores and 512 GB RAM across the cluster. In embodiments, storage-
intensive
configuration may be preferred and may differ from the balanced compute option
by
using larger chassis on the data nodes, which can thereby accommodate greater
numbers
of storage disks. This option may be preferred for storage-heavy workloads
such as:
Document and data archives; ETL offloading; Genomic BAM/FASTQ files; Images;
and
Data and files that are rarely accessed. In an exemplary embodiment of a
storage-
intensive configuration, four data nodes could be exposed to offer 24 TB of
usable space,
with a compute capacity of 64 cores and 512 GB RAM across the cluster. In
embodiments, a compute-intensive configuration may be preferred and, in
general, would
differ from the balanced compute option by having more RAM on the data node.
This
option may be preferred for compute-heavy workloads such as: Heavy data
science
workloads; 5+ concurrent data science users; Complex file types with high
streaming
ingest rates; or Workloads with high real-time or interactive components. In
an
exemplary embodiment of a compute-intensive configuration, the four data nodes
could
expose 12 TB of usable space, with a compute capacity of 64 cores and 1 TB RAM
across the cluster.
100911 Referring to Figure 3, an exemplary key-value schema is shown in
310. It is
represented with key-value pair parts following a base representation, such as
the
Accumulo framework, and also may be customized to a specific user's
implementation
within embodiments disclosed herein. The aforementioned exemplary schema can
be
found in 310. In one embodiment, the detailed description of the key-value
pair parts can
be found in 320.
100921 Figures 4 and 5 show schematics for how individual data components
are
generated in logical rows in a given key-value store. In particular, Figure 5
shows how
the data asset 420 (i.e. all logical rows for a data object) is developed
based on
governance and contextual information stored in a data repository and/or
provided by the
manager and/or master nodes (as shown collectively as 410). Each row in this
example,
comprises: a Collection ID 420D, which identifies all logical rows in a data
asset, data
object, or collection; a Row ID 420E, which uniquely identifies each logical
row; Stn
1082P-RRI-CADI 35
CA 2986320 2017-11-21

420A, which provides an indicator of security, sensitivity or authorization
and where n
denotes the logical row number; Tsn 420B, which provides a time stamp and
where n
denotes the logical row number; Descr, 420E, which denotes a descriptor of the
value; and
Vali, 420C, which provides a value (which may be acquired directly from raw
data or
derived using DPF functions 430. In Figure 5, a similar process is shown,
except
additional functions or outputs 510 are applied to create additional logical
rows that are
based on raw data and other derived data.
100931 Referring to Figure 6, there is shown a schematic representation of
the
derivation of a context specific dataset. Data from each of the datasets 420
can be
accessed and based on specific context and security (or sensitivity) tags, a
new dataset
610 can be generated, and in some cases stored, for use by a given user or
class of users.
Users (not shown) are given access to a dataset, although the dataset 610 can
be dynamic
in that a change to a security/sensitivity tag, or a governance requirement,
may
automatically cause the creation of a new dataset 610 or trigger a requirement
that the
user request a new dataset 610.
[0094] In accordance with one aspect, there is disclosed a computer-
readable
medium, having stored thereon instructions for execution by a computing device
in
network communication with a data storage system comprising a plurality of
data storage
components, each of said data storage components being in network
communication, and
configured for distributed storage of a plurality of data objects, each said
data object
comprising of a plurality of constituent data object components, the
instructions
executable to automatically implement the steps of the methods described
herein.
100951 In accordance with one aspect, there are provided methods, systems,
and
devices that assess the risk of re-identification of a given dataset or
collection of data.
Such dataset or collection of data may include a dataset derived in accordance
with
methods disclosed herein, or a collection of rows from a key-value store. In
some
embodiments, the risk of re-identification of a dataset or collection may be
assessed by
determining the likelihood or probability that a given set, row, or value can
be correlated
to an identifiable individual or subject. In some embodiments, a given derived
dataset can
1082P-RRI-CAD I 36
CA 2986320 2017-11-21

be associated with a risk of re-identification, wherein such a risk provides
an indication of
a probability that any given data object within the key value store that is
made part of a
derived dataset can be associated with an identifiable individual or subject
to which the
data object pertains. The higher such probability, the greater the risk re-
identification
indication. This risk indication may also be increased depending on the nature
of the data
object; for example, if the data object comprises sensitive personal
information, such as
but not limited to personal health or personal financial information. In
general, the risk of
re-identification will decrease if personally identifying information can be
withheld from
a dataset or obfuscated within a dataset. To the extent that this does not
impact the
informational value a dataset, or minimally impacts the informational value of
a dataset,
the re-identification risk can be used to optimally provide informational
value while
protecting the identity of the subjects of the information within the dataset.
[0096] In some such embodiments, the re-identification risk is a
measurement of (a)
the likelihood that any data object or data component thereof, or collection
thereof, can
be linked or associated with the subject or subjects to which it pertains. The
number of
same or similar data components within a dataset or other collection (that may
or may not
refer to other subjects) can be used to provide such an assessment of re-
identification risk.
In some embodiments, the assessment can provide the k-anonymity property of a
given
data set, although other methods of assessing re-identification risk that may
be known to
persons skilled in the art can be used, including t-closeness, /-diversity,
and privacy
differential. k-anonymity is a property of a given datum, or set of data
(including one or
more rows) indicating that such datum or set of data cannot be distinguished
from k-1
corresponding data or sets of data; an assessment of k-anonymity may be
applied in
respect of a particular field or type of metadata in a dataset. The k-
anonymity property of
data is described in Samarati, Pierangela; Sweeney, Latanya (1998).
"Protecting privacy
when disclosing information: k-anonymity and its enforcement through
generalization
and suppression", Harvard Data Privacy Lab, which is incorporated by reference
herein.
t-closeness, /-diversity, and privacy differential utilize statistical models
to provide an
indication of similarity between a given data component within a dataset that
is used to
calculate a risk of re-identification. See Ninghui Li, Tiancheng Li, and
Suresh
Venkatasubramanian (2007). "t-Closeness: Privacy beyond k-anonymity and 1-
diversity",
1082P-RRI-CAD 1 37
CA 2986320 2017-11-21

ICDE, Purdue University; and Dwork, Cynthia (2006). "Differential Privacy"
ICALP'06
Proceedings of the 33rd international conference on Automata, Languages and
Programming - Volume Part II, Pages 1-12, which is incorporated by reference
herein. In
some embodiments, a risk of re-identification is assessed for a given data set
and an
acceptable threshold may be applied for a given dataset and/or in respect of a
particular
field or type of metadata within such dataset. For example, for a dataset
comprising
personal health information ("PHI") and non-PHI, a re-identification risk in
respect of the
PHI data may be provided for the dataset, as well as another re-identification
risk in
respect of the non-PHI data may be provided. In another example, for any value
that is, or
any data set that includes PHI, or other sensitive information (e.g. personal
financial,
insurance, or other sensitive information), different acceptable threshold
risks of re-
identification may be applicable than for datasets that do not include PHI.
100971 In embodiments, upon generating a derived dataset, a risk of re-
identification
is determined for said dataset. In other embodiments, the re-identification
risk may be
determined thereafter. Depending on the determined risk, as well as other
factors, the
dataset may be made available to particular users. This availability may be a
function of
sensitivity of values on the dataset (e.g. whether it contains PHI or personal
financial
information ("PFI")), or the risk of re-identification, or the role or trust-
level of the
person/entity to whom the dataset is being made available (e.g. physician,
researcher,
bank teller, etc.), or the nature of the availability (e.g., transmission of a
new dataset or
access to a centralized repository), or the location of the user (e.g. remote
laptop, remote
server, server room, etc.), or a combination thereof.
[0098] In some embodiments, the re-identification risk may be associated
with the
concept of zones of trust, or location-based de-identification controls. In
general, when
datasets are de-identified, the dataset is then sent to (or made available to)
approved
targets without reference to the location of the target or the security
features/risks
associated with such a target's location. This may expose a potential risk of
re-
identification. In embodiments, there may be determined a Risk Acceptability
Threshold
(RAT) based on a determination of the specific risks associated with the
circumstances,
such circumstances including the dataset risk or sensitivity (which relates to
one or both
1082P-RRI-CAD I 38
CA 2986320 2017-11-21

of a re-identification risk and/or the sensitivity of such data), an
indication of user trust
(relating to a level of authorization or trust associated with a given user or
entity in
association with, in some embodiments, a sensitivity or sensitivities of the
data set), and a
location-based and/or security-based risk assessment of the computing devices
to where
the data set is to be provided (which may include associated or intermediary
computing
devices ¨ e.g. if a computing device is highly secure, but it must be
transmitted or
conveyed thereto via less secure intermediary devices, this may be taken into
consideration in some embodiments). For example, RAT may be determined as
Max(Dataset risk, User trust, Location controls). An exemplary process in
accordance
with embodiments hereof, may include: (1) optionally first determining an RAT
associated with a particular collection of data; (2) apply de-identification
or obfuscation
to specific fields in accordance with methods disclosed hereunder to generate
a de-
identified dataset; (3) Calculate the risk for each record (e.g. data
component) in the
dataset using a re-identification risk calculation algorithm (e.g. k-anonymity
determination algorithm); (4) Apply a filter to the data to meet the Risk
Acceptability
Threshold; (5) Restrict the dataset destination to only those targets that
meet the Risk
Acceptability Threshold. The location-control indication may be a pre-
determined value
associated with specific types of locations, or it may be determined in an ad
hoc manner
based on access or security characteristics associated with a specific
location. For
example, if a given dataset is associated with a 10% RAT, the dataset could be
restricted
to locations that meet the necessary location-control indication. In such an
example,
PHEMI Central may restrict target-locations such that 10% RAT can only be sent
to a
secure research environment and not, for example, downloaded to a user's
laptop.
Contrasting this with another dataset that may be de-identified to a 1% RAT
where it may
then be downloaded to a user's laptop. In some embodiments, the location-
control
indication may be associated with a "zone of trust", within which, possibly
based on the
security and/or ability for third-parties to access, may allow for the
provision of more
sensitive or risky data sets. Such zones of trust may be determined in advance
or
dynamically depending on criteria relating to security or indications of such
security;
either such case, whether pre-determined or dynamically determined based on
criteria
and/or circumstances, would constitute a designated zone of trust.
1082P-RRI-CADI 39
CA 2986320 2017-11-21

[0099] In some embodiments, there are provided systems and methods for
dynamically deriving additional data components associated with an existing
dataset that
modify the re-identification risk. For example, if a given dataset includes
data
components that present a given k-anonymity property (or other re-
identification risk
determination) that is too high for release to, or use by, a given user or at
a user location,
additional data components may be derived for a different dataset that, while
relating to
the same data objects, increase the k-anonymity score. This might include
replacing all
data components appearing within the data set that include an age, with a data
component
that uses a date range. While this may minimally reduce the informational
effectiveness
for a researcher, for example, it may nevertheless reduce the re-
identification risk
significantly as the number of same or similar rows will be increased. In some
embodiments, the possible users, locations, and/or user-location combinations
that can
access or have the dataset delivered thereto will be increased. Since there is
a metric (e.g.
RAT) applied to dataset risk, user trust, and location-risk, the system can
automatically
derive further obfuscated data components for generating new datasets. In some
embodiments, the user can indicate which fields should be preferentially
obfuscated (or
further obfuscated) so as to minimally impact informational effectiveness.
[00100] In some embodiments, selectively fulfilling a data request means that
a
request may or may not be fulfilled. The request may be fulfilled in some
embodiments,
for example, when a risk of re-identification, as indicated by the re-
identification risk
value associated with a data request, is lower than would be required under
the
circumstances. Such circumstances may include but is not limited to: the types
of
sensitivity (which may be referred to in some cases as an authorization level)
associated
with the data being returned in response to a data request; whether or not the
request has
originated from, or the data is being provided to or accessed from, a
designated zone of
trust; and/or the identity, role or other characteristic of the individual or
entity making the
data request. Notably, selectively fulfilling includes circumstances where the
context-
specific data set may not be provided. In such cases, some but certainly not
all
embodiments may result in further actions, such as but not limited to
dynamically
creating new data sets based on other key-value logical rows that have been
further
1082P-RRI-CAD I 40
CA 2986320 2017-11-21

obfuscated, dynamically creating new but further obfuscated key-value logical
rows, or
limiting distribution to (or access from) certain types of designated zones of
trust.
1001011 While the present disclosure describes various embodiments for
illustrative
purposes, such description is not intended to be limited to such embodiments.
On the
contrary, the applicant's teachings described and illustrated herein encompass
various
alternatives, modifications, and equivalents, without departing from the
embodiments, the
general scope of which is defined in the appended claims. Except to the extent
necessary
or inherent in the processes themselves, no particular order to steps or
stages of methods
or processes described in this disclosure is intended or implied. In many
cases the order
of process steps may be varied without changing the purpose, effect, or import
of the
methods described.
[00102] Information as herein shown and described in detail is fully capable
of
attaining the above-described object of the present disclosure, the presently
preferred
embodiment of the present disclosure, and is, thus, representative of the
subject matter,
which is broadly contemplated by the present disclosure. The scope of the
present
disclosure fully encompasses other embodiments which may become apparent to
those
skilled in the art, and is to be limited, accordingly, by nothing other than
the appended
claims, wherein any reference to an element being made in the singular is not
intended
to mean "one and only one" unless explicitly so stated, but rather "one or
more." All
structural and functional equivalents to the elements of the above described
preferred
embodiment and additional embodiments as regarded by those of ordinary skill
in the art
are hereby expressly incorporated by reference and are intended to be
encompassed by
the present claims. Moreover, no requirement exists for a system or method to
address
each and every problem sought to be resolved by the present disclosure, for
such to be
encompassed by the present claims. Furthermore, no element, component, or
method
step in the present disclosure is intended to be dedicated to the public
regardless of
whether the element, component, or method step is explicitly recited in the
claims.
However, that various changes and modifications in form, material, work-piece,
and
fabrication material detail may be made, without departing from the spirit and
scope of the
1082P-RRI-CAD I 41
CA 2986320 2017-11-21

present disclosure, as set forth in the appended claims, as may be apparent to
those of
ordinary skill in the art, are also encompassed by the disclosure.
[00103] While the present disclosure describes various exemplary embodiments,
the
disclosure is not so limited. To the contrary, the disclosure is intended to
cover various
modifications and equivalent arrangements included within the general scope of
the
present disclosure.
1082P-RRI-CAD1 42
CA 2986320 2017-11-21

Representative Drawing
A single figure which represents the drawing illustrating the invention.
Administrative Status

2024-08-01:As part of the Next Generation Patents (NGP) transition, the Canadian Patents Database (CPD) now contains a more detailed Event History, which replicates the Event Log of our new back-office solution.

Please note that "Inactive:" events refers to events no longer in use in our new back-office solution.

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Event History , Maintenance Fee  and Payment History  should be consulted.

Event History

Description Date
Deemed Abandoned - Failure to Respond to Maintenance Fee Notice 2024-05-21
Inactive: Office letter 2024-04-11
Inactive: Office letter 2024-04-11
Revocation of Agent Requirements Determined Compliant 2024-04-04
Revocation of Agent Request 2024-04-04
Letter Sent 2023-11-21
Letter Sent 2022-12-06
All Requirements for Examination Determined Compliant 2022-09-27
Request for Examination Requirements Determined Compliant 2022-09-27
Change of Address or Method of Correspondence Request Received 2022-09-27
Request for Examination Received 2022-09-27
Inactive: Recording certificate (Transfer) 2022-02-18
Inactive: Multiple transfers 2022-01-27
Inactive: Office letter 2022-01-21
Common Representative Appointed 2020-11-07
Common Representative Appointed 2019-10-30
Common Representative Appointed 2019-10-30
Inactive: Cover page published 2019-06-11
Inactive: First IPC assigned 2019-06-10
Inactive: IPC assigned 2019-06-10
Application Published (Open to Public Inspection) 2019-05-21
Inactive: Office letter 2019-01-15
Inactive: Delete abandonment 2019-01-15
Inactive: IPC expired 2019-01-01
Inactive: IPC expired 2019-01-01
Inactive: IPC removed 2018-12-31
Inactive: IPC removed 2018-12-31
Inactive: Abandoned - No reply to s.37 Rules requisition 2018-11-21
Letter Sent 2018-05-11
Inactive: Single transfer 2018-05-01
Inactive: Reply to s.37 Rules - Non-PCT 2018-05-01
Inactive: IPC assigned 2018-02-07
Inactive: First IPC assigned 2018-02-07
Inactive: IPC assigned 2018-02-07
Inactive: Filing certificate - No RFE (bilingual) 2017-11-30
Inactive: Request under s.37 Rules - Non-PCT 2017-11-28
Application Received - Regular National 2017-11-28

Abandonment History

Abandonment Date Reason Reinstatement Date
2024-05-21

Maintenance Fee

The last payment was received on 2022-10-07

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

  • the reinstatement fee;
  • the late payment fee; or
  • additional fee to reverse deemed expiry.

Patent fees are adjusted on the 1st of January every year. The amounts above are the current amounts if received by December 31 of the current year.
Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Fee History

Fee Type Anniversary Year Due Date Paid Date
Application fee - standard 2017-11-21
Registration of a document 2018-05-01
MF (application, 2nd anniv.) - standard 02 2019-11-21 2019-11-06
MF (application, 3rd anniv.) - standard 03 2020-11-23 2020-09-24
MF (application, 4th anniv.) - standard 04 2021-11-22 2021-11-17
Registration of a document 2022-01-27
Request for examination - standard 2022-11-21 2022-09-27
MF (application, 5th anniv.) - standard 05 2022-11-21 2022-10-07
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
FUSEFORWARD TECHNOLOGY SOLUTIONS LIMITED
Past Owners on Record
JOSEF ROEHRL
RUSS WEEKS
TIM TO
TRISTEN GEORGIOU
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column (Temporarily unavailable). To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Description 2017-11-20 42 2,171
Abstract 2017-11-20 1 26
Claims 2017-11-20 9 359
Drawings 2017-11-20 6 186
Representative drawing 2019-06-10 1 31
Cover Page 2019-06-10 2 75
Courtesy - Abandonment Letter (Maintenance Fee) 2024-07-01 1 544
Change of agent 2024-04-03 4 114
Courtesy - Office Letter 2024-04-10 2 226
Courtesy - Office Letter 2024-04-10 2 225
Filing Certificate 2017-11-29 1 201
Courtesy - Certificate of registration (related document(s)) 2018-05-10 1 103
Reminder of maintenance fee due 2019-07-22 1 111
Courtesy - Acknowledgement of Request for Examination 2022-12-05 1 431
Commissioner's Notice - Maintenance Fee for a Patent Application Not Paid 2024-01-01 1 552
Request Under Section 37 2017-11-27 1 57
Response to section 37 2018-04-30 4 120
Courtesy - Office Letter 2019-01-14 1 48
Maintenance fee payment 2019-11-05 1 27
Maintenance fee payment 2021-11-16 1 27
Courtesy - Office Letter 2022-01-20 2 48
Request for examination 2022-09-26 3 122
Change to the Method of Correspondence 2022-09-26 2 58