Note: Descriptions are shown in the official language in which they were submitted.
METHOD THAT REDACTS ZONES OF INTEREST IN AN AUDIO FILE USING
COMPUTER VISION AND MACHINE LEARNING
INVENTORS: DE LA FUENTE SANCHEZ, Alfonso Fabian
CABRERA VARGAS, Dany Alejandro
BACKGROUND
Law enforcement agencies around the globe are using body cameras for their
field
agents and security cameras inside buildings. From the images captured using
those
cameras, before releasing a video to the public or to other entities, they
need to protect
victims, suspects, witnesses, or informants' identities. There is a need for a
software
suite that can redact faces, license plates, addresses, voices, metadata, and
more to
respond to FOIA (USA), GDPR (Europe) or the Access to Information Act (Canada)
requests in a timely manner without having to compromise anyone's identity.
Law
enforcement agencies around the globe need to redact and enhance videos.
Manual
video redaction and enhancement, even for a short video, can be difficult and
take
hours work.
SUMMARY
In general, in one aspect, the invention relates to a method to replace zones
of interest
with silenced audio. Specifically by detecting the lip sync of a person with
the available
audio by using a trained machine learning system and video redaction tools. In
a
different embodiment of the invention, the audio is processed and analysed
using
speech recognition to identify words or phrases of interest or identifying
unrecognized
audio, then proceeding to redact the audio accordingly.
1
CA 3057939 2019-10-08
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows a typical structure of a CNN model for human face detection.
FIG. 2A is a side view of a person and a frontal camera.
FIG. 2A is a side view of a person and a frontal camera.
FIG. 2B is a side view of pedestrians with an overhead security camera.
FIG. 3A to 3C show illustrated steps for one embodiment of the invention.
FIG. 4A is a flowchart that shows how one embodiment of the invention works.
FIG. 4B shows a frontal picture used in one embodiment of the invention
FIG. 5A shows a second embodiment of the invention using overhead camera
footage.
FIG. 5B shows a different embodiment of the invention.
FIG. 6A shows a flowchart of the method for blurring.
FIG. 6B shows the blurring or masking module.
FIG. 7 is a flow chart of the system and method that describes how the
localization of
the subject takes place.
FIG. 8 is a flowchart that shows how we create a ground truth frame.
FIG. 9 is a flowchart, continuation of FIG. 8.
FIG. 10 describes a different embodiment of the invention.
FIG. 11 shows a diagram of a computing system.
FIG. 12 is a diagram describing the components of an input audio-video master
file
FIG. 13 shows a flowchart in accordance with one or more embodiments of our
invention.
FIG. 14 shows a flowchart describing how the audio redacting is performed once
the
identified audio track is selected
FIG. 15 shows a flowchart describing the system and method in accordance with
one or
more embodiments of the invention
FIG. 16 shows a flowchart describing a different embodiment of the invention
FIG. 17 shows a flowchart describing a different embodiment of the invention
2
CA 3057939 2019-10-08
DETAILED DESCRIPTION
Specific embodiments of the technology will now be described in detail with
reference to
the accompanying FIGS. In the following detailed description of embodiments of
the
technology, numerous specific details are set forth in order to provide a more
thorough
understanding of the technology. However, it will be apparent to one of
ordinary skill in
the art that the technology may be practiced without these specific details.
In other
instances, well-known features have not been described in detail to avoid
unnecessarily
complicating the description.
In the following description of FIGS., any component described with regard to
a figure,
in various embodiments of the technology, may be equivalent to one or more
like-named components described with regard to any other figure. For brevity,
descriptions of these components will not be repeated with regard to each
figure. Thus,
each and every embodiment of the components of each FIG. is incorporated by
reference and assumed to be optionally present within every other FIG. having
one or
more like-named components. Additionally, in accordance with various
embodiments of
the technology, any description of the components of a FIG. is to be
interpreted as an
optional embodiment, which may be implemented in addition to, in conjunction
with, or
in place of the embodiments described with regard to a corresponding like-
named
component in any other figure.
FIG. 1 shows a typical structure of a CNN model for human face detection. Our
system
relies on Computer Vision (a discipline in Artificial Intelligence) to detect
and localize the
subject of interest in a video and differentiate this person from other people
in the same
scene. We take advantage of Machine Learning techniques to increase the
accuracy of
our system, specifically a subset of techniques commonly known as deep
learning.
3
CA 3057939 2019-10-08
Deep learning refers to the use of a Convolutional Neural Network (CNN) (100),
which
is a multi-layered system able to learn to recognize visual patterns after
being trained
with millions of examples (i.e. what a person should look like in a picture).
After trained,
a CNN model can understand images it has never seen before by relying on what
it
learned in the past. This image-understanding method can also be applied to
video
footage, as video can be understood as a sequence of images (known as
"frames") and
individual video frames can be analyzed separately.
A CNN(100) is trained using an "image dataset" (110) which comprises a vast
collection of image samples, manually annotated to indicate what the CNN (100)
should
learn. As an example, a CNN (100) can be trained to detect faces using a
dataset of
thousands of images (110) where someone has manually indicated the position
and
size of every face contained on every image. This process of annotating the
training
information in a dataset is known as "labeling a dataset". The bigger and more
diverse a
dataset (110) is, the more the CNN (100) can learn from it whilst increasing
its
accuracy.
For illustrative purposes, FIG. 1 shows a common CNN (100) architecture for
face
recognition. The CNN (100) can be seen as a set of interconnected "layers"
(101, 102,
103, 104) each containing "neurons" (120) which are implemented as units that
produce
a numerical output from a set of numerical inputs through some mathematical
formula.
Neurons (120) are considerably interconnected (mimicking the human brain) so
their
outputs can serve as input for others. The types of layers, neurons, and
mathematical
formulas that govern their connections is diverse, and it requires domain
knowledge to
determine the optimal structure and mathematical formulas a CNN model (100)
should
have to serve a specific purpose.
4
CA 3057939 2019-10-08
Further describing the inner workings of a CNN, the FIG. 1 illustrative
example shows
how an input image from a dataset (110) can be provided to the CNN's input
layer
(101), which produces patterns of local contrast (111), then, that information
is fed to
the next first hidden layer (102) which produces an output with the face
features (112),
that information is then fed to the second hidden layer (103) which after
processing
produces a recognition of faces (113), which then are fed to the output layer
(104)
producing a result of possible matches. This structure is one of many that can
be used
to model a CNN to produce a prediction given an input image.
As one would expect, a CNN (100) will only learn what it was taught; for
example, a
CNN trained to detect human faces cannot detect full bodies. It will also
encounter
difficulties when the environmental conditions change (i.e. natural &
artificial lightning
conditions, weather changes, image noise and quality, video camera recording
format)
unless it has been trained to handle such conditions with a sufficiently vast
and diverse
dataset.
In practice, a single task might require not a single CNN but a group of them;
for
example, a smaller CNN might be used to improve the accuracy of the output
from a
bigger CNN, or the final result might be obtained from averaging individual
results of a
set of different CNNs (often called an "ensemble"). The best architecture for
a specific
purpose is usually obtained after experimentation.
FIG. 2A and 2B show how the different camera footage is considered. As
described in
FIG. 1, a CNN will only learn what it was taught; a CNN trained to recognize
footage of
people filmed from their front will struggle to recognize footage of people
recorded from
a ceiling. The system of our invention works when the camera is installed
facing people
CA 3057939 2019-10-08
in front as shown in FIG. 2A, and when the camera is installed in a high place
to film
people from above as shown on FIG. 2B. We describe both cases in the following
sections.
FIG. 2A is a side view of a person and a frontal camera. It shows a person
(200) facing
the front of a camera (201). Following the concepts discussed in FIG. 1, in
the interest
of optimizing performance and accuracy, our system includes two modules that
act
depending on the camera's position and distance from the people being filmed.
Note
that the methods described for both cases can be used simultaneously in the
same
video, since certain video footage might contain people from both cases. An
example of
a frontal camera is a body camera.
FIG. 2B is a side view of pedestrians with an overhead security camera. It
shows a side
view when the camera (202) is installed in a high place to film people (203,
204) from
above. One familiar with the art will appreciate that the angle from where the
camera is
filming or capturing the image of the person or pedestrian (203, 204) is
different from
what is shown in FIG. 2A. As such, it needs a different method to process the
subject
recognition described in this document.
FIG. 3A to 30 shows Illustrative steps for one embodiment of the invention. It
exemplifies frontal camera footage. FIG. 3A shows the original image with 4
persons
(301, 302, 303, 304). FIG. 3B shows how faces (311, 312, 313, 314) are
detected and
extracted from the original image for further analysis. FIG. 3C shows how all
faces are
transformed to a mathematical representation for comparison purposes (feature
vectors).
6
CA 3057939 2019-10-08
Continuing with the description of FIGS. 3A to 3C, in the same example using
the
frontal camera footage, when the camera is installed to record people from the
front so
that their face is exposed and recognizable as in FIG. 2A, the system of our
invention
relies on two modules; the first, a face-detection module including but not
limited to a
CNN architecture that detects and localizes faces in every video frame,
obtaining a
bounding box for every face so that we can extract them as illustrated in FIG.
3B,and
second, a face recognition module including but not limited to a CNN
architecture for
face recognition that is able to determine that a certain face belongs to a
certain person.
This process needs to transform every face found into a mathematical
representation
that we can compare against the representations of the subject of interest, as
illustrated
in FIG. 3C.
The face-detection module is implemented by reusing an existing network
architecture
meant for object detection. Some examples of existing compatible CNN
architectures
are: faced, YOL0v3, and MTCNN, among others. These architectures can be re-
trained
to detect faces in the scene environment conditions and video format required
by the
system, or alternatively, their structure can be used used as inspiration to
formulate a
customized face-detection CNN model for this use-case.
This "make or reuse" decision is often made based on a small study where we
test
multiple architectures against actual video footage provided by the user to
obtain a
quantitative benchmark that reveals the option with the highest accuracy and
performance for the technical particularities of the user's camera systems.
On the other hand, the process of transforming an image into its
representative
mathematical form as illustrated in FIGS. 3B and 3C is a process known as
"feature
extraction". Given an image, this process obtains a set of numbers (known as
"feature
vector") (321, 322, 323, 324) that represent the image's (311, 312, 313, 314)
visual
7
CA 3057939 2019-10-08
features. Feature vectors can be compared against each other using
mathematical
formulas (i.e. by their n-dimensional euclidean distance), allowing the system
to
determine the probability of two faces belonging to the same person. This step
in our
system relies on CNN models for facial feature extraction like FaceNet and
OpenFace,
to name a few published by the research community.
FIG. 4A shows a flowchart in accordance with one or more embodiments of the
invention. While the various steps in these flowcharts are presented and
described
sequentially, one of ordinary skill will appreciate that some or all of the
steps may be
executed in different orders, may be combined or omitted, and some or all of
the steps
may be executed in parallel. Furthermore, the steps may be performed actively
or
passively. The flowchart shows how one embodiment of our invention works. It
shows
the steps for our face recognition method, as described in FIGS. 1 through 3,
including
our method for blurring all faces that don't belong to a person of interest
(from now on
called "the subject" of "subject of interest") for the same example described
in FIGS. 3A
to 3C.
Steps 401 and 402 show the inputs to our method.
Step 401 - Video file: The user provides the system a video file where the
person of interest appears.
Step 402 - Pictures of the subject: The user loads one or multiple front-face
images of the subject of interest into our system. If this is not possible,
our
system will alternatively present an interface that facilitates the gathering
of these
images from the video itself.
After preparing the input, the following is a description of the method's
steps:
8
CA 3057939 2019-10-08
Step 411 - Extract subject's features: Through our face recognition module,
the system will obtain the feature vectors of the pictures of the subject and
preserve them for later.
Step 412 - Detect faces in all frames: For every frame in the video, the
system
will use our face detection module to detect and localize all faces, obtaining
a
"bounding box" for every face found.
Step 413 - Extract features from all faces: For every face found, the system
will use our face recognition module to obtain their corresponding feature
vectors.
Step 414 - Compare all faces: The system will use our face recognition module
to compare all the feature vectors obtained in (5) with the feature vectors
obtained for our subject of interest in (2). For all faces in the video, this
process
will output a probability of the face belonging to the subject of interest
(measured
as a number from 0.0 to 1.0)
Step 415 - Label the subject: All faces with a high probability of belonging
to the
subject of interest are labeled as positive cases, while every other face is
labeled
as a negative case. One familiar with the art will appreciate that a face
could also
be reflected in mirrors or shiny objects such as chromed items for example. In
a
different embodiment of the invention, the system allows the user or operator
to
select objects of interest, for example faces and then for the system to
redact or
mask these objects from the scene. One familiar with the art will appreciate
that
mask can also be described as black-out.ln a different embodiment of the
invention, the operator of the system for redacting complicated scenes that
may
have a large number of objects that contain personal identifiable information,
for
9
CA 3057939 2019-10-08
example multiple people, and objects can select an area to be blurred instead
of
just blurring individual objects of interest.
Step 416 - Blur faces: Using our blur module, all faces labeled as negative
cases are blurred. The blurring process is also known as redact the objects
from
the scene, wherein the objects include faces, text, and other identifiable
objects.
Step 417 - The output produced by our method is a video where all faces
different from the subject of interest will appear blurred.
In a different embodiment of the invention, once the system and method
successfully
recognizes the subject, it uses the same feature vectors in other videos where
the
subject of interest is present, which makes possible the recognition of the
subject in
multiple videos using the same process. In a different embodiment of our
invention,
after running all the steps described above, if the pictures provided for the
subject don't
suffice for its successful recognition, the user can try again or start a new
process from
Step 402 by loading a different set of pictures or by gathering the subject's
pictures from
a video where the face appears clearly.
FIG. 4B shows a frontal picture and how each face is marked with a box.
FIG. 5A shows a second embodiment of the invention using overhead camera
footage.
It shows when the camera is installed in a ceiling or some kind of high place
to film,
capture, or record the images of people from above. In the example shown in
FIG. 5A,
the system is not expected to detect the faces of people on the ground, as
most often
will not be recognizable by traditional automated computer vision methods due
to lower
video quality (required for reasonable costs on storage) and the camera angle
not
CA 3057939 2019-10-08
allowing full face exposure. Instead, we track the entire person's body (501,
502, 503,
504, 505, 506, 507) using 2 modules, the first being a pedestrian detection
module
comprising a CNN architecture that detects and localizes pedestrians (501,
502, 503,
504, 505, 506, 507) in every video frame, obtaining a bounding box for every
pedestrian
so that we can extract them as illustrated in FIG. 5B where the second module
is a
pedestrian recognition module comprising a CNN architecture for pedestrian
recognition
that is able to determine the probability that one of the detected pedestrians
(511, 512,
513, 514, 515, 516, 517) looks the same as the subject of interest. This
process needs
to transform every pedestrian found into a mathematical representation that we
can
compare against the representations of the subject of interest, as illustrated
in FIG. 5C.
FIG. 5B shows a different embodiment of our invention, the pedestrian
detection module
reuses an existing network architecture meant for object detection. Some
generic
examples of compatible network architectures are MobileNets, Inception-v4,
Faster
R-CNN, to name a few. These architectures can be re-trained to detect
pedestrians
(501, 502, 503, 504, 505, 506, 507) in the scene environment conditions and
video
format required by the user, or alternatively, their structure can be used as
inspiration to
formulate a customized pedestrian detection CNN model for this use-case. This
"make
or reuse" decision is made based on a small study where we test multiple
alternatives
against actual video footage provided by the user to obtain a quantitative
benchmark
that reveals the option with the highest accuracy and performance for the
technical
particularities of the user's camera systems. The output is known as the
pedestrian
detection's probability.
Note that in cases where the pedestrian detection's probability isn't
confident enough to
make a decision on marking a pedestrian (i.e. the pedestrian doesn't look as
expected
or is covered by other people, etc.), further information like the position
and speed of
pedestrians in previous and future video frames might be used to refine and
support the
decision.
11
CA 3057939 2019-10-08
In a different embodiment of our invention, the process of transforming an
image (511,
512, 513, 514, 515, 516, 517) into its representative mathematical form (521,
522, 523,
524, 525, 526, 527) as illustrated in FIG. 5C is the same "feature extraction"
process
explained for Case A. We again rely on the feature descriptor generation
capabilities of
existing CNN generic object detection models to obtain the feature vector of
every
pedestrian.
FIG. 6A shows a flowchart describing the method for blurring or masking all
pedestrians
or people that don't belong to a person of interest (from now on called "the
subject") in
accordance with one or more embodiments of the invention.
The inputs to our method are:
Step 620 - Video file: The user provides the system a video file where the
person
of interest appears.
Step 621 - Subject selection- After all pedestrians in the video have been
detected, the system will ask the user to select (i.e. click on) the person of
interest. This operation might be required more than once in order to improve
detection results.
The method steps can be described as follows:
Step 630 - Detect pedestrians in all frames: Using our pedestrian detection
module we obtain the bounding box for all pedestrians present for every frame
in
the video.
12
CA 3057939 2019-10-08
Step 631 - Extract features from all pedestrians: For every pedestrian found,
the
system will use our pedestrian recognition module to obtain their
corresponding
feature vectors.
Step 633 - Extract subject's features: The user will manually select the
detected
pedestrian that matches the subject of interest. Further frames of the same
pedestrian detected with a high degree of confidence might be used to increase
the amount of information on this pedestrian.
Step 634 - Compare all pedestrians: The system will use our pedestrian
recognition module to compare all the feature vectors obtained in (2) with the
feature vectors obtained for our subject of interest in (3). For all
pedestrians in
the video, this process will output a probability of the pedestrian being the
subject
of interest (measured as a number from 0.0 to 1.0)
Step 635 - Label the subject: All pedestrians with a high probability of being
the
subject of interest are labeled as positive cases, while every other
pedestrian is
labeled as a negative case.
Step 636 - Blur faces: Using our blur module, blur all faces belonging to
pedestrians labeled as negative cases.
FIG. 6B shows the blurring or masking module. In all cases of detected faces
or
pedestrians or subjects of interest, we use a separate face blur module that
is in charge
of masking or blurring the person's head (601).
One embodiment of our invention uses the following blurring methods from an
original
image (601) where the subject's face (611) is not covered or blurred: In a
first
13
CA 3057939 2019-10-08
embodiment of the invention it uses the covering method 602): In this case, we
cover
the head (611) with a solid color polygon (602) (i.e. a black ellipse) to
completely hide it.
In a different embodiment of the invention the Blurring method (603) is used.
In this
case the system applies a "pixelation"(613) filter to a polygon (i.e. an
ellipse) that
surrounds the head. The pixelation filter will divide the head area into a
"grid", and for
every cell in the grid, replace it with a square with the average color of all
pixels
contained in said cell.
In a different embodiment of the invention - to protect against algorithms
that try to
recover the original image using different methods including but not limited
to reverse
engineering a blurred image - the system randomly switches pixel places to
make
recovery impossible. In a different embodiment of the invention, the polygon,
size of
pixels, and amount of pixels switched are user-configurable. One familiar with
the art
will appreciate that this method can also be used to blur ID tags as well
(when text is
detected), by switching the blur area polygon to a rectangle. These features
are user
configurable too.
When the face of the person has already been detected and localized (i.e.
after face
detection), the system applies the blur method to the area of the face.
However, when
dealing with the whole human body (as in the pedestrian detection module
described
above), the head must first be accurately localized, and its size and position
will vary
depending on the camera angle and the subject's pose (i.e. the geometrical
coordinates
of the head of a person standing and a person sitting will appear different).
To localize the head on detected pedestrians or subjects of interest, this
module
includes a head detection method based on a CNN architecture for human pose
estimation i.e. OpenPose among others. The pose estimation CNN outputs an
approximation of the head's position and size, which the system then proceeds
to
obfuscate.
14
CA 3057939 2019-10-08
In a different embodiment of the invention, in regard to text detection, the
system of our
invention uses reliable scene text detection methods available and are
implemented for
direct application. In a different embodiment of the invention, the system
uses methods
that achieve higher degrees of accuracy and utilize deep learning models; for
example,
the EAST, Pyrboxes, and Textboxes++ methods to name a few published by the
research community. In a different embodiment of the invention, after the
detection and
localization of the text in the video footage has occurred, the system then
blurs any text
that is positioned inside the bounding box of any detected pedestrian or below
a nearby
detected face. Examples of text detection that one may want to blur for
privacy
concerns are name tags, licence plates, and any other text that needs to be
protected
for privacy concerns.
FIG. 7 shows a flowchart describing how the localization of the subject is
performed in
accordance with one or more embodiments of the invention.
Step 701 localizing a subject of interest in a geographical area, One familiar
with the art will appreciate that this could take the form of a coordinate, a
region,
a building, room, or similar area such as a street or plaza. One familiar with
the
art will also appreciate that a subject of interest might be a person, or part
of a
person, for example the face of a person. The subject of interest could also
be an
animal, a tag, or a licence plate, to name a few.
Step 702 Does the subject of interest carry a geolocation sensor? Wherein
the geolocation sensor is connected to a remote server where the device
running
our software can retrieve this data.
Yes, then 703. No, then 704
CA 3057939 2019-10-08
Step 703, retrieve the geopositioning data from a smart gadget. Wherein
localizing a person of interest comprises using the geopositioning data from a
smart gadget's sensor that the person of interest carries. For example, smart
gadgets such as smartphones have geopositioning sensors, where the data
collected from those sensors can be transmitted to a remote server or directly
to
the device running the software of our invention, and, by using software
applications, one with the appropriate permissions could access the
geopositional data generated by the smart gadget of the subject of interest.
With
such data, one could identify the location of the subject of interest using
coordinates sent by the smart gadget. In a different example, when the subject
of
interest is an internet of things device, such as an autonomous vehicle, a
shared
bicycle, or a drone (to name a few), those devices by themselves already have
geopositioning sensors and data transmission capabilities that transmit their
localization. Then Step 705
Step 704, detecting the presence of the person of interest in an area,
manually, by automatic face detection, or by other means of detection such as
a
request from the person of interest himself by providing the date and time and
location of the person of interest. For example, people in Canada that have a
legislated right to request access to government information in the Access to
Information Act and their own personal information Privacy Act. Then step 705.
FIG. 10 describes the process for the face detection.
Step 705 matching the localization of the subject of interest with a first
video, wherein a first video comprises a video from a recording or live video
feed
from a camera, for example, a security camera or a smart gadget. Wherein the
frames in the first video comprise all the frames in the first video, selected
frames
in the first video.
16
CA 3057939 2019-10-08
In a different embodiment of the invention the localization of the subject is
made
manually either from live video feed monitoring or from pre-recorded tapes of
different
cameras and areas. For example, video footage that is reviewed by employees to
identify scenes containing certain individuals. Such video footages may come
from a
single camera or multiple cameras in a single location or multiple locations.
In this
example, the first video mentioned in Step 704 of FIG. 1 will be considered a
manually
identified video or videos containing the image of the subject of interest.
FIG. 10 shows
how a face detection can replace the manual entry.
FIG. 8 shows a flowchart describing how the invention creates a ground truth
frame
when the images come from a camera installed facing people from above in
accordance
with one or more embodiments of the invention. Wherein the subjects are
pedestrians.
FIG. 8 is a continuation of FIG. 7.
Step 801 Video selection: Loading a video where the subject of interest is
present, as per described in FIG. 1
Step 802 selecting a first still frame from the first video, usually is one
from
the first group of frames where the subject of interest appears in the video.
For
example, a video of a thief, taken by a security camera, where in the video,
other
subjects are present, for example a girl and a policeman. All three subjects
are
identified, but only one individual is the subject of interest that we want to
show
his face, for the rest identified objects (the girl and the policeman), their
faces
need to be masked.
17
CA 3057939 2019-10-08
Step 803 detecting and localizing individual subjects in the first still frame
by using pre-trained convolutional neural networks,
Step 804 obtaining a bounding box for all individual subjects present in
frames of the first video, wherein the frames include the first still frame,
Step 805 creating a first ground truth frame by selecting and marking a
detected individual subject as the marked subject at the first still frame.
Wherein
the selection of the detected individual as the person of interest is one or
more
from the group of marking the subject by a user, automatically detecting the
individual by matching the individual's biometrics. One familiar with the art
will
appreciate that biometrics is the technical term for body measurements and
calculations. Biometrics refers to metrics related to human characteristics.
Biometrics authentication (or realistic authentication) is used in computer
science
as a form of identification and access control. It is also used to identify
individuals
in groups that are subjects of interest or under surveillance. Biometric
identifiers
are the distinctive, measurable characteristics used to label and describe
individuals. Biometric identifiers are often categorized as physiological
versus
behavioral characteristics. Physiological characteristics are related to the
shape
of the body.
FIG. 9 shows a flowchart describing the method to identify a person and
masking their
faces in accordance with one or more embodiments of the invention. FIG. 9 is a
continuation of FIG. 8.
Step 901 Obtaining a set of features for the marked subject by extracting the
visual features of the marked subject from any of the subsequent frames and
any
contiguous frame where the marked subject is determined to be present.
18
CA 3057939 2019-10-08
Wherein the visual features of the marked individual subject comprises one or
more from the group of feature vectors.
Step 902 Obtaining feature vectors for all individual subjects detected.
Obtaining, for every other frame where the marked subject is determined to be
present, feature vectors for all individual subjects detected. Then the next
step.
Step 903 compute the vector distance between the feature vectors obtained
and the feature vectors stored for the marked subject,
Step 904 determining if any of the individual subjects matches the marked
subject. Also determining the position and displacement speed of the
individual
subjects to discard unlikely matches.
Step 905 masking every detected individual subject that does not match
the marked subject. Wherein masking comprises one or more from the group of
blurring, changing individual pixels to a different color than the original
pixel color.
FIG. 10 shows a flowchart describing a different embodiment of the invention
when the
camera captures the images facing people or subjects from, when their faces
are
exposed and recognizable in accordance with one or more embodiments of the
invention.
Step 1001 Load a video recording where the person of interest appears For
example, a person walks into an office that has a security camera system that
includes personal (face) cameras or overhead (pedestrian) cameras.
19
CA 3057939 2019-10-08
Step 1002 Face detection. Detect and locate faces in the video, using a
pre-trained deep learning model for face detection and localization. This
model
can be obtained by training (or re-training) a Convolutional Neural Network
(CNN) with an architecture that supports face detection configured to detect
and
localize a single class (human face). Examples of network architectures for
this
step: faced, YOL0v3, MTCNN, amongst other public alternatives. In a different
embodiment of the invention, as each technique performs differently depending
on the video quality, scene complexity, distance to the subject, among others,
the
system uses one or more face detection techniques.
As output, this step infers the position and size of the "bounding box" for
all faces
in the frame.
Step 1003 Assemble the set of faces detected into this frame's "face
group". Having obtained a bounding box for all faces in the frame, each
bounding box's image contents is copied and stored to process separately from
the rest of the frame. This method addresses the set of cropped faces obtained
this way as the "face group" for the frame.
Step 1004 Group feature extraction: For each face in the "face group", encode
the face into a "feature vector" as explained in Step 1002.
Step 1005 - Feature matching. Compare the feature vectors of all faces in this
frame's "face group" with the feature vectors available for the person of
interest
(a process often called "distance measurement").
Step 1006 - Does the "face group" contain a face where the distance
measured is low enough to be a match? If yes, then 1007 if not then 1002. If
the "face group" contains a face where the distance measured is low enough
CA 3057939 2019-10-08
(within thresholds configurable by the user), it is "marked" as the person of
interest.
Step 1007 - Masking every "face group" not marked as the person of
interest. Wherein masking comprises one or more from the group of blurring,
changing individual pixels to a different color than the original pixel color.
In a
different embodiment of the invention, in every video frame where the subject
is
successfully recognized, a record is produced on the coordinates of the
subject's
face. If due to motion blur or noise the subject's face cannot be recognized,
the
position of the face between frames is predicted based on the face's
coordinates
in previous and future frames.
Embodiments of the invention may be implemented on a computing system. Any
combination of mobile, desktop, server, router, switch, embedded device, or
other types
of hardware may be used. For example, as shown in FIG. 11, the computing
system
(1100) may include one or more computer processors (1101), non-persistent
storage
(1102) (for example, volatile memory, such as random access memory (RAM),
cache
memory), persistent storage (1103) (for example, a hard disk, an optical drive
such as a
compact disk (CD) drive or digital versatile disk (DVD) drive, a flash memory,
etc.), a
communication interface (1104) (for example, Bluetooth interface, infrared
interface,
network interface, optical interface, etc.), and numerous other elements and
functional ities.
The computer processor(s) (1101) may be an integrated circuit for processing
instructions. For example, the computer processor(s) may be one or more cores
or
micro-cores of a processor. The computing system (1100) may also include one
or
more input devices (1110), such as a touchscreen, keyboard, mouse, microphone,
touchpad, electronic pen, or any other type of input device.
21
CA 3057939 2019-10-08
The communication interface (1104) may include an integrated circuit for
connecting the
computing system (1100) to a network (not shown) (for example, a local area
network
(LAN), a wide area network (WAN) such as the Internet, mobile network, or any
other
type of network) and/or to another device, such as another computing device.
Further, the computing system (1100) may include one or more output devices
(1106),
such as a screen (for example, an LCD display, a plasma display, touch screen,
cathode ray tube (CRT) monitor, projector, or other display device), a
printer, external
storage, or any other output device. One or more of the output devices may be
the
same or different from the input device(s). The input and output device(s) may
be
locally or remotely connected to the computer processor(s) (1101), non-
persistent
storage (1102) , and persistent storage (1103). Many different types of
computing
systems exist, and the aforementioned input and output device(s) may take
other forms.
Software instructions in the form of computer readable program code to perform
embodiments of the invention may be stored, in whole or in part, temporarily
or
permanently, on a non-transitory computer readable medium such as a CD, DVD,
storage device, a diskette, a tape, flash memory, physical memory, or any
other
computer readable storage medium.
Specifically, the software instructions may
correspond to computer readable program code that, when executed by a
processor(s),
is configured to perform one or more embodiments of the invention.
The computing system (1100) in FIG. 11 may be connected to or be a part of a
network.
For example, a network may include multiple nodes (for example, node X, node Y
).
Each node may correspond to a computing system, or a group of nodes combined
may
correspond to the computing system shown in FIG. 11. By way of an example,
embodiments of the invention may be implemented on a node of a distributed
system
that is connected to other nodes. By way of another example, embodiments of
the
22
CA 3057939 2019-10-08
invention may be implemented on a distributed computing system having multiple
nodes, where each portion of the invention may be located on a different node
within
the distributed computing system. Further, one or more elements of the
aforementioned
computing system (1100) may be located at a remote location and connected to
the
other elements over a network.
FIG. 12 is a diagram describing the components of an input audio-video master
file
(1200) comprising one or more tracks (1201, 1202, 1203); tracks can be audio
(1202),
video (1201), or metadata files (1203), to name a few. Each video file is made
of
frames. Each frame has a timestamp which corresponds to a linear timeline. A
track can
be edited or cut. When edited, it can be split into other tracks or combined
with other
tracks. For example, a single audio track may contain a conversation between
two
people, if the audio for each person is identified, that single track can be
split into two or
more tracks, each track with each individual's voice. In a reverse scenario, a
couple of
tracks can be edited to become one track. When redacting an audio file, the
content of a
portion of the audio file can be cut or replaced with a different audio
source. For
example, such audio source may take the form of silence, a beep, a continuous
tone,
or a different voice. The Metadata track may also contain other information
such as
source of the video, date and time, and other data.
FIG. 13 shows a flowchart in accordance with one or more embodiments of our
invention. While the various steps in these flowcharts are presented and
described
sequentially, one of ordinary skill will appreciate that some or all of the
steps may be
executed in different orders, may be combined or omitted, and some or all of
the steps
may be executed in parallel. Furthermore, the steps may be performed actively
or
passively.
FIG. 13 shows a flowchart describing the system and method to automatically
replace
zones of interest with alternate audio from an audio and video recording or
feed in
23
CA 3057939 2019-10-08
accordance with one or more embodiments of the invention. Continuing with the
description of FIG. 13, the zone of interest for audio may include, for
example, any
audio that contains information that is personally identifiable and private to
anyone that
is not the interested party - an interested party may be, for example, a
requestor of
information, a judge, lawyer, police, or any other person who has an interest
in a person
of interest in a video but should keep the privacy of others intact. For
example, the video
is a recording of an interview, and during the interview a person of interest,
in this case,
an alleged criminal, answers a question by providing the name and address of
an
acquaintance. The name and address of the acquaintance would need to be
redacted
by bleeping the later portion of the audio.
Step 1301 - Master audio track input. The master audio track can be a single
track (mono) or multiple tracks, i.e. left and right audio tracks (stereo).
The
master audio track as well as the video track are part of a linear timeline
that runs
simultaneously. When used with a video file that contains audio, the audio is
stored in the audio track of the video file, thus the name audio-video file or
recording.
Step 1302 - Matching a first audio tone (for example pitch of their voice, the
speed of their talk, their language), from a master audio track, to a first
person
while the first person moves their lips. This process can be automatic or
manual.
Automatic is using computer vision to identify when the person moves the lips.
Manual is by manually identifying the person where the operator thinks or
knows
is the person who is talking. The automatic process includes using a pre-
trained
machine learning convolutional neural network architectureto identify lips on
a
face that move at the same time the audio track reproduces the sound of the
voice of a person. If a person is detected to move their lips while a voice
recording is detected in the audio track, the probability that the person
moving
24
CA 3057939 2019-10-08
the lips is high enough to mark that person to be the person talking. When two
or
more persons become subjects of interest due to the fact that they move their
lips when the sound of a voice is detected in the audio track, an exhaustive
process of elimination by evaluation of the audio and video tracks is done
when a
person who is suspected to be the person of interest does not move their lips
when the audio track detects the voice to which that person was matched to,
then, by a process of elimination, that person is no longer the subject of
interest
producing that voice. This process repeats until all of the individuals
identified in
the video are matched to a voice recording. Each voice can be identified by
their
individual owner using speaker identification which refers to identifying the
speaker, rather than what they are saying. Recognizing the speaker can
simplify
the task of translating speech in systems that have been trained on a specific
person's voice or it can be used to authenticate or verify the identity of a
speaker
as part of a security process.
Step 1303 - Isolating a first identified audio track from the rest of the
audio based
on the first audio tone. This process is then made for other individuals in
the
same audio-video recording or track. The number of confluent individuals at
the
same time in a segment of the recording could represent the number of
individuals speaking. At any given time, the system may have visual to the
individual's lips or not. It is when the computer vision system has access to
the
lips of the face of the individual that the match can be made. Once the
individual
voice has been identified by an individual owner, it is not necessary for the
camera to have a view of the lips of the owner of the voice to identify that
is the
individual talking. This process is important as the final redacted video may
require to remove the audio from a specific individual, or to leave only the
audio
of the specific individual and redact the rest of the audio.
CA 3057939 2019-10-08
Step 1304 - Matching a second audio tone, from the master audio track, to a
second person while the second person moves their lips,
Step 1305 - Isolating a second identified audio track from the rest of the
audio
based on the second audio tone,
Step 1306 - Making a determination of the identified audio track to silence.
If the
audio track is divided into individual tracks by the voice of each individual,
one
familiar with the art will appreciate that an audio in which 2 or more
individuals
talk at the same time can be redacted so only one individual's voice is heard
on
the final redacted audio track, or, that one single individual audio or voice
is
redacted so that all other voices are heard and only the one from that
individual
is redacted or silenced.
FIG. 14 shows a flowchart describing how the audio redacting is performed once
the
identified audio track is selected in accordance with one or more embodiments
of the
invention. FIG. 14 is a continuation of the process from FIG. 13.
Step 1401 - Creating a linear timestamp of periods of time to silence by
matching
the audio tracks to silence with the audio and video timeline. One familiar
with
the art will appreciate that when we mention silence it also represents the
insertion of alternate audio which comprises one or more from the group of
blurring, a bleep, silence, whitenoise, anonymizing of voices
Step 1402 - Editing the master audio track silencing the timestamped periods
of
time to silence,
Step 1403- Output an edited audio and video recording of feed.
26
CA 3057939 2019-10-08
FIG. 15 shows a flowchart describing the system and method in accordance with
one or
more embodiments of the invention.
Step 1501 - speech recognition cannot translate the speech to text. This can
happen when the speech recognition does not recognize the word the person is
saying either because the word is not in the selected language's dictionary or
because the audio is incomprehensible.
Step 1502 - marking the area of interest as undetermined for the operator to
manually confirm the automatic blurring of the audio.
Step 1503 - Playback the undetermined audio to the operator.
Step 1504 - The operator confirms or changes the marking of the unspecified
audio either by redacting the audio or by leaving the audio track intact.
FIGS. 16 and 17 show a different embodiment of the invention
FIG. 16 shows a flowchart describing the system and method to automatically
replace
zones of interest with alternate audio from an audio and video recording or
feed.
Wherein alternate audio comprises one or more from the group of blurring, a
bleep,
silence, whitenoise, anonymizing of voices in accordance with one or more
embodiments of the invention.
Step 1601 - matching the audio track of an audio-video recording to a
language.
wherein the matching can be automatic or manual
27
CA 3057939 2019-10-08
Step 1602 - using speech recognition identify a phrase of interest. The speech
recognition engine used is the one matching the language. In a different
embodiment of the invention, if more than one language is detected, the speech
recognition is then run more than one time, each with a different language
matching the detected languages within the audio-video recording. Speech
recognition is an interdisciplinary subfield of computational linguistics that
develops methodologies and technologies that enables the recognition and
translation of spoken language into text by computers. It is also known as
automatic speech recognition (ASR), computer speech recognition, or speech to
text (STT). It incorporates knowledge and research in linguistics, computer
science, and electrical engineering fields. Speech recognition has a long
history
with several waves of major innovations. Most recently, the field has
benefited
from advances in deep learning and big data. There has been a worldwide
industry adoption of a variety of deep learning methods in designing and
deploying speech recognition systems.
Continuing with the description of step 1602, one familiar with the art will
appreciate that a phrase of interest may comprise one or more from the group
of
a predetermined phrases from a first list, unidentified words, combination of
words from a second list. For example, a number before a word is interpreted
as
an address.
In a different embodiment of the invention, a database is created with words
or
phrases of interest, i.e. street names, city names, web addresses, individual
names in one or different languages or slang terms, to name a few. Based on
that database, the voice recognition module compares the identified audio from
the speech recognition and compares it with the words or phrases in the
database. If a percentage of confidence is higher or lower than the parameters
28
CA 3057939 2019-10-08
stipulated by default or by the user, then, those words or phrases become
candidates for phrases of interest and are processed as such described below.
Step 1603 - editing the master audio track by replacing the phrase of interest
with
alternate audio. The editing of the master audio track includes replacing the
words or phrases of interest with alternate audio which includes one or more
from the group of blurring, a bleep, silence, whitenoise, anonymizing of
voices.
For example, the subject of interest says a phrase that is recorded in the
audio
track, which means that the zone of interest for audio may include, for
example,
any audio that contains information that is personally identifiable and
private. For
example, when the video is a recording of an interview and during the
interview a
person or subject of interest says the name and address of a third party. The
name and address of that third party needs to be redacted by redacting that
portion of the audio.
Step 1604 - output an edited audio and video recording or feed. The final
product
after the redacting of the audio-video or just the audio track itself is a
file that
cannot be reverse-engineered. One familiar with the art will appreciate that
the
final product is a file that only contains the contents of a new audio track
which
cannot be reversed-engineered to FIG. out the phrases that are redacted. In a
different embodiment of the invention, an editable file can be produced which
also contains the original audio track in the same timeline of the audio-video
file
as the final output file in order for an authorized person to be able to
compare
what has been redacted to what was the original content of the audio-video
file.
One familiar with the art will appreciate that the input file can be an audio
file or
an audio-video file combination and that the output file or final product can
be an
audio file or an audio-video file,
29
CA 3057939 2019-10-08
FIG. 17 shows a flowchart describing a different embodiment of the invention
Step 1701 - matching the audio track of an audio-video recording to a
language.
wherein the matching can be automatic or manual
Step 1702 - using speech recognition identify a phrase of interest within the
audio-video track
Step 1703 - is the result undetermined (area of interest not identified)?
Undetermined audio is an audio segment that the speech recognition module
was not able to identify or translate either because the word is a slang, an
adjective, a name of a person, street, city, email address, web address, an
identifiable number such as a street name, phone number, or any other name or
phrase without a meaning in the translation.
Yes Step 1704, no step 1709
Step 1704 - marking the area of interest as undetermined for the operator to
manually confirm the automatic blurring of the audio. Alternatively creating
an
audio track with all the undetermined audio unredacted so the operator can
disseminate if those audio segments should be redacted or not.
Step 1705 - Playing back all the undetermined audio to an operator. This
option
is made so the operator can identify if the redaction of the audio is correct
or not.
This track plays the
Step 1706 - Is the automatic blurring of the audio correct? Yes Step 1707, no
Step 1708
CA 3057939 2019-10-08
Step 1707 - The operator confirms that the automatic blurring of the audio is
correct then Step 1709
Step 1708 the operation reverses it. Then step 1709
Step 1709 - editing the master audio track by Replacing the phrase of interest
with alternate audio,
Step 1710- output an edited audio and video recording of feed.
While the invention has been described with respect to a limited number of
embodiments, those skilled in the art, having benefit of this disclosure, will
appreciate
that other embodiments can be devised which do not depart from the scope of
the
invention as disclosed herein. Accordingly, the scope of the invention should
be limited
only by the attached claims.
31
CA 3057939 2019-10-08