Note: Descriptions are shown in the official language in which they were submitted.
VIDEO TO DATA
CLAIM OF PRIORITY
This application claims priority to U.S. Patent Application No. 14/175,741,
filed
February 7,2014, which claims priority to U.S. Provisional Patent Application
No. 61/866,175,
filed on August 15, 2013, and claims priority to U.S. Provisional Patent
Application No.
62/021,666, filed July 7, 2014.
TECHNICAL FIELD
The present invention relates to a method and a system for generating various
and useful
data from videos.
BACKGROUND
In the field of image contextualization, distributed reverse image similarity
searching can
be used to identify images similar to a target image. Reverse image searching
can find exactly
matching images as well as flipped, cropped, and altered versions of the
target image. Distributed
reverse image similarity searching can be used to identify symbolic similarity
within images.
Audio-to-text algorithms can be used to transcribe text from audio. An
exemplary application is
note-taking software. Audio-to-text, however, lacks semantic and contextual
language
understanding.
SUMMARY
The present invention is generally directed to a method to generate data from
video content,
such as text and/or image-related information. A server executing the method
can be directed by a
program stored on a non-transitory computer-readable medium. The video text
can be, for
example, a context description of the video.
An aspect of the method can include generating text from an image of the
video, converting
audio associated with the video to text, extracting topics from the text
converted from the audio,
cross-referencing the text generated from the image of the video and the
topics extracted from
audio associated with the video, and generating video text based on a result
of the cross-
referencing.
1
Date Recue/Date Received 2021-06-28
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
In some embodiments, natural language processing can be applied to the
generation of text from an image of the video, converting audio associated
with the video
to text, or both.
In other embodiments, the text from the image of the video can be generated by
identifying context, a symbol, a brand, a feature, an object, and/or a topic
in the image of
the video.
In yet other embodiments, the text from the image can be generated by first
segmenting images of the video, and then converting the segments of images to
text in
parallel. The text from the audio can be generated by first segmenting images
of the
audio, and then converting the segments of images to text in parallel. The
audio can be
segmented at spectrum thresholds. The generated text may be of different
sizes. The size
of the text can be adjusted by a ranking or scoring function that, for
example, can adjust
the text size based on confidence in the description or relevance to a search
inquiry. The
text can describe themes, identification of objects or other information of
interest.
In some embodiments, the method can include generating advertising and/or
product or service recommendations based on video content. The video content
can be
text, context, symbols, brands, features, objects, and/or topics related to or
found in the
video. An advertisement and/or product or service recommendations can be
placed at a
specific time in the video based on the video content and/or section symbol of
a video
image. The advertisement and/or product or service recommendations can also be
placed
at a specific time as part of the video player, e.g., side panel, and also may
be placed on a
second screen. In some embodiments, the method can include directing when one
or more
advertisements can be placed in a predetermined context at a preferred time.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is further described in the detailed description which
follows, in reference to the noted plurality of drawings by way of non-
limiting examples
of certain embodiments of the present invention, in which like numerals
represent like
elements throughout the several views of the drawings, and wherein:
FIG. 1 illustrates an embodiment of present invention.
FIG. 2 illustrates an embodiment of image data processing.
FIG. 3 illustrates an embodiment of audio data processing.
FIG. 4 illustrates another embodiment of present invention.
2
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
FIG. 5 illustrates various exemplary embodiments of present invention.
FIG. 6 illustrates a flow diagram of an embodiment
FIG. 7 illustrates an embodiment of the architecture of the present invention.
FIG. 8 illustrates a flow diagram of an embodiment of image recognition.
FIG. 9 illustrates an embodiment of a graphical user interface of the present
invention.
DETAILED DESCRIPTION
A detailed explanation of the system and method according to exemplary
embodiments of the present invention are described below. Exemplary
embodiments
described, shown, and/or disclosed herein are not intended to limit the
claims, but rather,
are intended to instruct one of ordinary skill in the art as to various
aspects of the
invention. Other embodiments can be practiced and/or implemented without
departing
from the scope and spirit of the claimed invention.
The present invention is generally directed to system, device, and method of
generating content from video files, such as text and information relating to
context,
symbols, brands, features, objects, faces and/or topics found in the images of
such videos.
In an embodiment, the video-to-content engine can perform the functions
directed by
programs stored in a computer-readable medium. That is, the embodiments may
take the
form of a hardware embodiment (including circuits), a software embodiment, or
an
embodiment combining software and hardware. The present invention can take the
form
of a computer-program product that includes computer-useable instructions
embodied on
one or more computer-readable media.
The various video-to-content techniques, methods, and systems described herein
can be implemented in part or in whole using computer-based systems and
methods.
Additionally, computer-based systems and methods can be used to augment or
enhance
the functionality described herein, increase the speed at which the functions
can be
performed, and provide additional features and aspects as a part of or in
addition to those
described elsewhere in this document. Various computer-based systems, methods
and
.. implementations in accordance with the described technology are presented
below.
A video-to-content engine can be embodied by the a general-purpose computer or
a server and can have an internal or external memory for storing data and
programs such
as an operating system (e.g., DOS, Windows 2000TM, Windows XPTM, Windows NTTm,
3
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
OS/2, UNIX or Linux) and one or more application programs. Examples of
application
programs include computer programs implementing the techniques described
herein for
lyric and multimedia customization, authoring applications (e.g., word
processing
programs, database programs, spreadsheet programs, or graphics programs)
capable of
generating documents or other electronic content; client applications (e.g.,
an Internet
Service Provider (ISP) client, an e-mail client, or an instant messaging (IM)
client)
capable of communicating with other computer users, accessing various computer
resources, and viewing, creating, or otherwise manipulating electronic
content; and
browser applications (e.g., Microsoft's Internet Explorer) capable of
rendering standard
Internet content and other content formatted according to standard protocols
such as the
Hypertext Transfer Protocol (HTTP). One or more of the application programs
can be
installed on the internal or external storage of the general-purpose computer.
Alternatively, application programs can be externally stored in or performed
by one or
more device(s) external to the general-purpose computer.
The general-purpose computer or server may include a central processing unit
(CPU) for executing instructions in response to commands, and a communication
device
for sending and receiving data. One example of the communication device can be
a
modem. Other examples include a transceiver, a communication card, a satellite
dish, an
antenna, a network adapter, or some other mechanism capable of transmitting
and
receiving data over a communications link through a wired or wireless data
pathway.
The general-purpose computer or server may also include an input/output
interface that enables wired or wireless connection to various peripheral
devices. In one
implementation, a processor-based system of the general-purpose computer can
include a
main memory, preferably random access memory (RAM), and can also include a
secondary memory, which may be a tangible computer-readable medium. The
tangible
computer-readable medium memory can include, for example, a hard disk drive or
a
removable storage drive, a flash based storage system or solid-state drive, a
floppy disk
drive, a magnetic tape drive, an optical disk drive (Blu-Ray, DVD, CD drive),
magnetic
tape, paper tape, punched cards, standalone RAM disks, Iomega Zip drive, etc.
The
removable storage drive can read from or write to a removable storage medium.
A
removable storage medium can include a floppy disk, magnetic tape, optical
disk (Blu-
Ray disc, DVD, CD) a memory card (CompactFlash card, Secure Digital card,
Memory
Stick), paper data storage (punched card, punched tape), etc., which can be
removed from
4
the storage drive used to perform read and write operations. As will be
appreciated, the
removable storage medium can include computer software or data.
In alternative embodiments, the tangible computer-readable medium memory can
include
other similar means for allowing computer programs or other instructions to be
loaded into a
computer system. Such means can include, for example, a removable storage unit
and an interface.
Examples of such can include a program cartridge and cartridge interface (such
as the found in
video game devices), a removable memory chip (such as an EPROM or flash
memory) and
associated socket, and other removable storage units and interfaces, which
allow software and data
to be transferred from the removable storage unit to the computer system.
An embodiment of video-to-content engine operation is illustrated in FIG. 1.
At 110, a
video stream is presented. The video stream may be in format of (but not
limited to): Advanced
Video Codec High Definition (AVCHD), Audio Video Interlaced (AVI), Flash Video
Format
(FLV), Motion Picture Experts Group (MPEG), Windows Media Video (WMV), or
Apple
QuickTime (MOV), h.264 (MP4).
The engine can extract audio data and image data (e.g. images or frames
forming the video)
from the video stream.
In some embodiments, the video stream and the extracted image data can be
stored in a
memory or storage device such as those discussed above. A copy of the
extracted image data can
be used for processing.
At 120, the video-to-content engine performs an image data processing on the
video stream.
An example of the image data processing is illustrated in FIG. 2. In FIG. 2,
the video data 310 can
be segmented into N segments and processed in parallel (e.g., distributed
processing 320-1 to 320-
N), allowing for near real-time processing.
An example of the video image data processing can be symbol (or object) based.
Using
image processing technique such as color edge detection, a symbol of a screen
or an image of the
video can be isolated. The symbol can be identified using an object template
database. For
example, the symbol includes 4 legs and a tail, and when matched with the
object template
database, the symbol may be identified as a dog. The object template database
can be adaptive and
therefore, the performance would improve with usage.
Other image data processing techniques may include image extraction, high-
level vision
and symbol detection, figure-ground separation, depth and motion perception.
5
Date Recue/Date Received 2021-06-28
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
Another example of video image processing can be color segmentation. The
colors of an image (e.g., a screen) of the video can be segmented or grouped.
The result
can be compared to a database using color similarity matching.
Based on the identified symbol, a plurality of instances of the symbol can be
.. compared to a topic database to identify a topic (such as an event). For
example, the result
may identify the dog (symbol) as running or jumping. The topic database can be
adaptive
to improve its performance with usage.
Thus, using the processing example above, text describing a symbol of the
video
and topic relating to the symbol may be generated, as is illustrated in FIG.
9. Data
generated from an image and/or from audio transcription can be time stamped,
for
example, according to when it appeared, was heard, and/or according to the
video frame
from which it was pulled.
At 330, the engine combines the topics as an array of keys and values with
respect
to the segments. The engine can segment the topics over a period of time and
weight the
strength of each topic. Further, the engine applies the topical meta-data to
the original full
video. The image topics can be stored as topics for the entire video or each
image
segment. The topic generation process can be repeated for all identifiable
symbols in a
video in a distributed process. The outcome would be several topical
descriptors of the
content within a video. An example of the aggregate information that would be
derived
using the above example would be understanding that the video presented a dog,
which
was jumping, on the beach, with people, by a resort.
Identifying various objects in an image can be a difficult task. For example,
locating (segmenting) and positively identifying an object in a given frame or
image can
yield false positives ¨ locating but wrongfully identifying an object.
Therefore, present
.. embodiments can be utilized to eliminate false positives, for example, by
using context.
As one example, if the audio soundtrack of a video is an announcer calling a
football
game, then identification of ball in a given frame as a basketball can be
assigned a
reduced probability or weighting. As another example of using context, if a
given series
of image frames from a video is positively or strongly identified as a horse
race, then
identifying an object to be a mule or donkey can be given a reduced weight.
Using the context or arrangement of certain objects in a given still or static
image
to aid in computer visual recognition accuracy can be an extremely difficult
task given
certain challenges associated with partially visible or self-occluded objects,
lack of
6
CA 02920795 2016-02-08
WO 2015/120351
PCT/US2015/014940
objects, and/or faces, and/or words or an overly cluttered image, etc.
However, the linear
sequencing of frames from a video ¨ as opposed to a stand-alone image ¨ avails
itself to a
set images {images x-y} from which context can be derived. This contextual
methodology can be viewed as systematic detection of probable image false
positives by
identifying an object from one video frame (or image) as an anomaly when
compared to
and associated with a series of image frames both prior and subsequent to the
purported
anomaly. According to the objects, faces, words, etc. of a given set of frames
(however
defined), a probability can be associated with an identified anomaly to
determine whether
an image may be a false positive and, if so, what other likely results should
be.
In certain instances, identification of an individual can be a difficult task.
For
example, facial recognition can become difficult when an individual's face is
obstructed
by another object like a football, a baseball helmet, a musical instrument, or
other
obstructions. An advantage of some embodiments described herein can include
the ability
to identify an individual without identification of the individual's face.
Embodiments can
use contextual information such as associations of objects, text, and/or other
context
within an image or video. As one example, a football player scores a touchdown
but
rather than identifying the player using facial recognition, the player can be
identified by
object recognition of, for example, the player's team's logo, text recognition
of the
player's jersey number, and by cross referencing this data with that team's
roster (as
oppose to another team, which is an example of why the logo recognition can be
important). Such embodiments can further learn to identify that player more
readily and
save his image as data.
Similarly, the audio transcript of a video can be used to derive certain
context
helpful in identifying and correcting or eliminating image false positives. In
this way, an
image anomaly or anomalies identified in a given video frame(s) are associated
with time
(time stamped) and correlated with a time range from the transcribed audio to
establish
certain probabilities of accuracy.
Moreover, the aforementioned methodologies ¨ establishing probabilities of
accuracy of image identification from a set of frames and from the audio
transcription ¨
.. can be combined to improve the results.
In some embodiments, a similar context methodology can be used to identify
unknown objects in a given image by narrowing a large, or practically
infinite, number of
possibilities to a relatively small number of object possibilities and
assigning
7
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
probabilities. For example, neuro-linguistic programming (NLP), neural network
programming, or deep neural networks can be utilized to achieve sufficient
narrowing and
weighting. For further example, based on a contextual review of a large number
of objects
over a period of time, a series of nodes in parallel and/or in series can be
developed by the
.. processor. Upon initial recognition of objects and context, these nodes can
assign
probabilities to the initial identification of the objection with each node in
turn using
context and further description to narrow the probabilistic choices of an
object. Other
methodologies can be utilized to determine and/or utilize context as described
herein.
Natural language processing can be useful in creating an intuitive and/or user-
.. friendly computer-human interaction. In some embodiments, the system can
select
semantics or topics, following certain rules, from a plurality of possible
semantics or
topics, can give them weight based on strength of context, and/or can do this
a distributed
environment. The natural language processing can be augmented and/or improved
by
implementing machine-learning. A large training set of data can be obtained
from
.. proprietary or publicly available resources. For example, CBS News
maintains a database
of segments and episodes of "60-Minutes" with full transcripts, which can be
useful for
building a training set and for unattended verification of audio segmentation.
The
machine learning can include ensemble learning based on the concatenation of
several
classifiers, i.e. cascade classifiers.
At 130, an optional step of natural language processing can be applied to the
image text. For example, based on dictionary, grammar, and a knowledge
database, the
text extracted from video images can be modified as the video-to-content
engine selects
primary semantics from a plurality of possible semantics. In some embodiments,
the
system and method can incorporate a Fourier transform of the audio signal.
Such filtering
.. can improve silence recognition, which can be useful for determining proper
placement of
commas and periods in the text file.
In parallel, at 140, the video-to-content engine can perform audio-to-text
processing on audio data associated with the video. For example, for a movie
video, the
associated audio may be the dialog or even background music.
In addition to filtering of the audio signal, images from the video signal can
be
processed to address, for example, the problem of object noise in a given
frame or image.
Often images are segmented only to locate and positively identify one or very
few main
images in the foreground of a given frame. The non-primary or background
images are
8
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
often treated as noise. Nevertheless, these can provide useful information,
context and/or
branding for two examples. To fine-tune the amount of object noise cluttering
a data set,
it can be useful to provide a user with an option to dial image detection
sensitivity. For
certain specific embodiments, identification of only certain clearly
identifiable faces or
large unobstructed objects or band logos can be required with all other image
noise
disregarded or filtered, which can require less computational processing and
image
database referencing, in turn reducing costs. However, it may become necessary
or
desirable to detect more detail from a frame or set of frames. In such
circumstances, the
computational thresholds for identification of an object, face, etc. can be
altered
according to a then stated need or desire for non-primary, background,
obstructed and/or
grainy type images. Such image identification threshold adjustment capability
can be
implemented, for example, as user-controlled interface, dial, slider, or
button, which
enables the user to make adjustments to suit specific needs or preferences.
An example of the audio data processing is illustrated in FIG. 3. In FIG. 3,
the
.. audio data 410 can be segmented into N segments and processed in parallel
(e.g.,
distributed processing 420-1 to 420-N), allowing for near real-time
processing.
In some embodiments, the segmentation can be performed by a fixed period of
time. In another example, quiet periods in the audio data can be detected, and
the
segmentation can be defined by the quiet periods. For example, the audio data
can be
processed and converted into a spectrum. Locations where the spectrum
volatility is
below a threshold can be detected and segmented. Such locations can represent
silence or
low audio activities in the audio data. The quiet periods in the audio data
can be ignored,
and the processing requirements thereof can be reduced.
Audio data and/or segments of audio data can be stored in, for example, memory
or storage device discussed above. Copies of the audio segments can be sent to
audio
processing.
The audio data for each segment can be translated into text in parallel, for
example through distributed computing, which can reduce processing time.
Various audio
analysis tools and processes can be used, such as audio feature detection and
extraction,
audio indexing, hashing and searching, semantic analysis, and synthesis.
At 430, text for a plurality of segments can then be combined. The combination
can result in segmented transcripts and/or a full transcript of the audio
data. In an
9
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
embodiment, the topics in each segment can be extracted. When combined, the
topics in
each segment can be given a different weight.
The audio topics can be stored as topics for the entire video or each audio
segment.
At 150, an optional step of natural language processing can be applied to the
text.
For example, based on dictionary, grammar, and/or a knowledge database, the
text extract
from the audio stream of a video can be given context, an applied sentiment,
and topical
weightings.
At 160, the topics generated from an image or a frame and the topics extracted
from audio can be combined. The text can be cross-referenced, and topics
common to
both texts would be given additional weights. At 170, the video-to-content
engine
generates video text, such as text describing the content of the video, using
the result of
the combined texts and cross reference. For example, key words indicating
topic and
semantic that appear in both texts can be selected or emphasized. The output
can also
include metadata that can be time-stamped with frame references. The metadata
can
include the number of frames, the range of frames, and/or timestamp
references.
FIG. 4 illustrates another embodiment of the present invention. User equipment
(UE) 210 can communicate with a server or servers 220 via a network 230. An
exemplary
embodiment of the system can be implemented over a cloud computing network.
For exemplary purposes only, and not to limit one or more embodiments herein,
FIG. 6 illustrates a flow diagram of an embodiment. A video file is first
split into video
data and audio data. A data pipeline, indicated in the figure as Video
Input/Output, can
extract sequences of image frames and can warehouse compressed images in a
distributed
data store as image frame data. A distributed computation engine can be
dedicated to
image pre-processing, performing e.g. corner and/or edge detection and/or
image
segmentation. The engine can also be dedicated to pattern recognition, e.g.
face detection
and/or logo recognition, and/or other analysis, such as motion tracking.
Processed data
can be sent to one or more machines that can combine and/or sort results in a
time-
ordered fashion. Similarly, the Audio Input/Output represents a data pipeline
for e.g.
audio analysis, compression, and/or warehousing in a distributed file system.
The audio
can be, for example but not limited to WAV, .MP3, or other known formats. Also
similarly
to the video branch, a distributed computation engine can be dedicated to
audio pre-
processing, e.g. noise removal and/or volume adjustment, pattern recognition,
e.g.
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
transcription and/or keyword detection, and/or other analysis, e.g.
identifying unique
speakers. Processed audio data can be sent to one or more machines that
reassemble
transcript segments in their correct time-order. A time-stamped transcript can
be sent
through an NLP, or other preferred system or analysis, which can transform the
data in
time-ordered topics and/or subject matter. Both branches converge to output
data from
parallel video and audio pipelines. The output data can be synced into one or
more
machines that can combine image and audio generated topics and/or tags which
can be
applied towards a number of user experiences or user-defined outputs. Such
experiences
can include search engine optimization, video categorization, recommendation
engines,
advertisement targeting, content personalization, analytics, etc. The output
can include
metadata that is time-stamped with frame references. The metadata can include
the
number of frames, the range of frames, and/or timestamp references.
The UE 210 can include, for example, a laptop, a tablet, a mobile phone, a
personal digital assistant (PDA), a keyboard, a display monitor with or
without a touch
screen input, and an audiovisual input device. In another implementation, the
peripheral
devices may themselves include the functionality of the general-purpose
computer. For
example, the mobile phone or the PDA may include computing and networking
capabilities and function as a general purpose computer by accessing a network
and
communicating with other computer systems.
The server 220 can include the general purpose computer discussed above.
The network 230 includes, for example, the Internet, the World Wide Web,
WANs, LANs, analog or digital wired and wireless telephone networks (e.g.,
Public
Switched Telephone Network (PSTN), Integrated Services Digital Network (ISDN),
and
Digital Subscriber Line (xDSL)), radio, television, cable, or satellite
systems, and other
delivery mechanisms for carrying data. A communications link can include
communication pathways that enable communications through one or more
networks.
In some embodiments, a video-to-content engine can be embodied in a server or
servers 220. The UE 210, for example, requests an application relating to the
video
stream. The servers 220 perform the audio-to-text process on the segmented
audio in
parallel. The distributed audio-to-text processing reduces the overall
response time. This
method allows real-time audio-to-text conversion.
The UE 210 communicates with the server 220 via the network 230 for video
stream application. The video-to-content engine can generate the video text as
illustrated
11
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
in FIG. 1. The server 220 then generates advertisement (text, images, or
animation) based
on the video text. In some embodiments, the server adds the advertisement to a
specific
symbol, image, frame, or a specific time in the video stream. The specific
symbol, image,
frame, or the specific time in the video stream can be selected based on the
video text.
The server 220 can add the audio text to the video stream in real time (i.e.
real
time close caption).
The server 220 can generate video recommendation based on a database of the
video text. In some embodiments, the server 220 can search videos based on the
video
text (e.g., via a database of video text). In this fashion, video search can
be optimized.
Applications for the video search optimization may include search engine
optimization
(SEG), search engine marketing (SEM), censorship and removal materials of
copyright
violation.
The video streams can be videos viewed by a user, and the server 220 generates
a
preference profile for the user using the video data.
In an embodiment, as shown in FIG. 5 for example, a server node can fetch a
video file. For example, a URL can be used to fetch the video file from an
Internet such as
YouTube, and from such URL the video can be scraped. The server can divide the
video
into chunks of smaller data files for processing on several nodes of a cluster
in parallel.
For example, the video file can be separated into audio files and image frame
files. Each
of the types of files can be normalized.
The normalized audio files can be split into constituent files for processing
and
reduction in parallel by various nodes. Various reduction processes can be
performed on
the constituent audio files such as phoneme detection and assembly as well as
grammar
assembly. An output of the audio processing steps can be an extracted text
map.
The normalized image frame files can be processed in order to extract various
data
maps, such as a text map, a tag map, a brand, an object map, a feature map,
and/or a
tracking map. Such maps can be achieved through various extraction steps. For
example,
the normalized image frame files can be analyzed for text identification
and/or by optical
character recognition. The data can be improved through a dictionary
verification step.
Various maps can be created based on edge detection and/or image segmentation
techniques. Such techniques can be improved by focusing on regions of
interest, for
example based on brands, logos, objects, and/or features of interest.
Additionally, or
alternatively, pixel gradients of the normalized image frame files can be
analyzed and/or
12
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
the files can be segmented by temporal and/or spatial components, and thus,
for example,
allow extraction of motion within the video images, which in turn can be used
for
tracking.
Identification of motion or action in a still image can be a challenge in the
vision
science field. However, the linear sequencing of frames from a video ¨ as
opposed to a
stand-alone image ¨ avails itself motion detection. A series of sequential
frames can be
analyzed in groups to identify actions, rather than merely objects, or, as a
manifestation of
data, verbs rather than nouns. For example, an object found across several
frames can be
identified and the object's motion can be determined with a high degree of
accuracy. For
further example, a processor can analyze a collection of x sequential frames
to identify a
basketball found in each frame. The processor then can analyze the motion of
the
basketball to determine that a basket was made by slam-dunking. Or, a certain
automobile
may be identified in frame a, with a crashed automobile identified in frame z,
with the
sequential frames in between a and z identifying the action of said car
crashing. The
accuracy of the action-identification can be improved by utilizing contextual
recognition
methodologies discussed herein. For example, the probability of positively
identifying a
slam-dunk action can be increased if Michael Jordan is identified in the
context of the
video and/or images. Action identification can be further improved, in
addition to
modeling context of objects, by analyzing human poses, for example by building
a
.. learning set of still images capturing known actions.
In an embodiment, as shown in FIG. 7 for example, to generate metadata,
several
sub-systems may operate on the video. An image recognition sub-system 750 can
take
frames from the video, and isolate and identify objects present in the frame.
An audio
recognition sub-system 760 can include automatic speech recognition, sound
identification and music identification. A natural language processing (NLP)
sub-system
770 can annotate and assign meaning to keywords that are generated by the
image and
audio sub-systems.
The automatic speech recognition (ASR) model can be a set of statistical
structures and operation used for determining words from expected audio
signals. The
ASR model can consist of an acoustic model (AM) and a language model (LM)
between
which there is near perfect overlap. The acoustic model can map audio speech
features to
sounds/word-parts. For example, a series of features might be mapped to the ah
sound in
'bath.' The language model can consist of a dictionary of known words and
their
13
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
phonetic mappings to sounds/word-parts and a statistical weighting of
likelihood of a
given word occurring given the previous one or two words. Speech may contain
words
and phrases not commonly used in "regular" language, e.g., double dribble,
free throw,
slap shot, high-sticking, etc. Accordingly, the language model can also
consist of a topic-
specific dictionary, with associated phonetic variants for each words, and may
also consist
of a statistical N-gram model of word probabilities, e.g, "slam dunk" is a
common phrase
but will be used more frequently in sports reporting than in general language.
The acoustic model can process audio waveforms to generate a series of speech
features based on Mel-Frequency Cepstral Coefficients (MFCCs) that are
generated using
a series of signal processing techniques including pre-emphasis, cosine or
hamming
windowing, FFT, Mel-filtering, log power spectrum, DCT and the computation of
delta
and delta-delta coefficients. The automatic speech recognition can encompass a
wide
variety of tasks, e.g., connected digit recognition, dialogue systems, large
vocabulary
continuous speech recognition (LVCSR). Automatic speech recognition can work
in part
by having a statistical understanding of what words can follow other words.
The following operational sub-systems (710,720, 730, 740, 780, 790, 795) can
be
utilized to support the metadata generation process. Web application servers
710 can
provide clients with the ability to use the service provided by the system,
e.g., upload,
monitor progress and receive outcomes. A video processing sub-system 720 can
transform the video file into data on which the metadata generating sub-
systems operate.
An auditing/coordination sub-system 730 can monitor the overall system
performance,
and can generate operational and business analytic data. An operational data
storage sub-
system 740 can store the generated metadata as well as operational and
business analytic
data for use in active, online processes. A search sub-system 780 can index
client results,
and can make them searchable via the web application. An offline data storage
system
795 can hold the history of all operations performed in the system including
business and
operational data. An extract-transform-load (ETL) subsystem 790 can regularly
write the
to the offline data storage sub-system.
An architecture based on distributed message queuing and distributed data
storage
700 may be utilized to build a scalable system, to optimally allocate
resources for
performance, to enhance failure and overload resiliency. A distributed message
queuing
system may produce data that gets delivered to a particular queue at which
time it gets
14
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
consumed by a component that watches that queue. The distributed queuing
system can
be removed.
For exemplary purposes only, and not to limit one or more embodiments herein,
FIG. 8 illustrates a flow diagram of an embodiment of image recognition. The
images can
be classified as faces and objects, and typically an image can contain faces
and objects.
Image recognition can include two components: image detection 800 and image
recognition 810. Image detection can be utilized to determine if there is a
pattern or
patterns in an image that meet the criteria of a face, image, or text. If the
result is
positive, the detection processing then moves to recognition. All fractal
computations can
occur in recognition. Recognition processing can include creating a fractal
representation
of the face or object that was detected, performing a match to an existing
database of
faces and objects, and assigning a value (name) to the face or object and then
returning to
the requesting program.
The system can utilize facial recognition algorithms to identify facial
fractals by
extracting landmarks from an image of the subject's face. For example, the
algorithm
may analyze the relative position, size, and/or shape of the eyes, nose,
cheekbones, and
jaw. These features can then be used to search for other images with matching
features.
Other algorithms can normalize a gallery of face images and then compress the
face data,
only saving the fractal data in the image that is useful for face recognition.
A probe
image can then be compared with the face data. Recognition algorithms can be
divided
into two main approaches, geometric, which looks at distinguishing features,
or
photometric, which is a statistical approach that distills an image into
values and
compares the values with templates to eliminate variances.
The recognition algorithms may include Principal Component Analysis using
Eigen faces, Linear Discriminate Analysis, Elastic Bunch Graph Matching using
the
Fisherface algorithm, the Hidden Markov model, the Multi-linear Subspace
Learning
using tensor representation, and the neuronal motivated dynamic link matching.
A hybrid
using fractal genesis can be constructed to detect the face with elements
described above.
Three-dimensional face recognition can also be used. This technique can use 3D
sensors to capture fractal information about the shape of a face. This
information can
then be used to identify distinctive features on the surface of a face, such
as the contour of
the eye sockets, nose, and chin.
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
One advantage of 3D facial recognition is that it is not affected by changes
in
lighting like other techniques. It can also identify a face from a range of
viewing angles,
including a profile view. Three-dimensional data points from a face vastly
improve the
precision of facial recognition.
To improve the accuracy of detection, the hybrid can also use the visual
details of
the skin, as captured in standard digital or scanned images. This technique,
called skin
texture analysis, barns the unique lines, patterns, and spots apparent in a
person's skin into
a mathematical fractal space. Tests have shown that with the addition of skin
texture
analysis, performance in recognizing faces can increase 20 to 25 percent.
The following recognition models may be utilized:
PCA : Derived from Karhunen-Loeve's transformation. Given an s-dimensional
vector
representation of each face in a training set of images, Principal Component
Analysis (PCA) tends to find a t-dimensional subspace whose basis vectors
correspond to the maximum variance direction in the original image space. This
new subspace is normally lower dimensional (t<<s). If the image elements are
considered as random variables, the PCA basis vectors are defined as
eigenvectors
of the scatter matrix.
Linear Discriminant Analysis (LDA): finds the vectors in the underlying space
that best
discriminate among classes. For all samples of all classes the between-class
scatter matrix SB and the within-class scatter matrix SW are defined. The goal
is
to maximize SB while minimizing SW, in other words, maximize the ratio
det1S131/det SW . This ratio is maximized when the column vectors of the
projection matrix are the eigenvectors of (S WA-1 >< SB).
Aa eigenspace-based adaptive approach: searches for the best set of projection
axes in
order to maximize a fitness function, measuring at the same time the
classification
accuracy and generalization ability of the system. Because the dimension of
the
solution space of this problem is too big, it is solved using a specific kind
of
genetic algorithm called Evolutionary Pursuit (EP).
Elastic Bunch Graph Matching (EBGM): All human faces share a similar
topological
structure. Faces are represented as graphs, with nodes positioned at fiducial
points.
(exes, nose...) and edges labeled with 2-D distance vectors. Each node
contains a
set of 40 complex Gabor wavelet coefficients at different scales and
orientations
(phase, amplitude). They are called "jets". Recognition is based on labeled
16
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
graphs. A labeled graph is a set of nodes connected by edges, nodes are
labeled
with jets, edges are labeled with distances.
Kernel Methods: The face manifold in subspace need not be linear. Kernel
methods are a
generalization of linear methods. Direct non-linear manifold schemes are
explored to learn this non-linear manifold.
Trace transform: A generalization of the Radom transform, is a new tool for
image
processing which can be used for recognizing objects under transformations,
e.g.
rotation, translation and scaling. To produce the Trace transform one computes
a
functional along tracing lines of an image. Different Trace transforms can be
produced from an image using different trace functional.
3-D Morphable Model: Human face is a surface lying in the 3-D space
intrinsically.
Therefore the 3-D model should be better for representing faces, especially to
handle facial variations, such as pose, illumination etc. Blantz et al.
proposed a
method based on a 3-D morphable face model that encodes shape and texture in
terms of model parameters, and algorithm that recovers these parameters from a
single image of a face.
Bayesian Framework: A probabilistic similarity measure based on Bayesian
belief that
the image intensity differences are characteristic of typical variations in
appearance of an individual. Two classes of facial image variations are
defined:
intrapersonal variations and extrapersonal variations. Similarity among faces
is
measures using Bayesian rule.
Hidden Markov Models (HMM): These are a set of statistical models used to
characterize
the statistical properties of a signal. HMM consists of two interrelated
processes:
(1) an underlying, unobservable Markov chain with a finite number of states, a
state transition probability matrix and an initial state probability
distribution and
(2) a set of probability density functions associated with each state.
Fractal Genesis(Hybrid): The image intensity differences are characteristic of
typical
variations in appearance of an individual. Human face is a surface lying in
the
Fractal space intrinsically. Since most of the parameters are self-similar,
the
Fractal model should be better for representing faces, especially to handle
facial
variations, such as pose, illumination etc. A Hybrid, wherein many proponents
of
other algorithms are integrated to form a fractal genesis.
17
An advantage of present embodiments includes the ability to provide real-time
or faster-
than-real-time content output. This can be achieved through one or more
components and/or steps.
For example, a video file can be distributed across at least two layers for
processing. The audio
can be converted to text on at least one layer, and the images can be
processed on at least one other
layer. In some embodiments, natural language processing can abstract topics,
sentiments, temporal
topic-tagging, and can be used for further optimization and/or machine
learning. The layers can
include node clusters for parallel processing chunks of the video file into
the preferred content. In
some embodiments, the files can be maintained and processed in parallel at
each step, and then
combined into a single data file as one of the terminal processing steps.
Present embodiments have wide application. For example, video indexing,
reverse image
lookup, video co-groupings and graph searches, and video similarity indexing,
as described herein,
can be used for searching, for classification, and for recommendations
regarding processed videos.
Law enforcement and security industries can implement embodiments for object
recognition and
motion detection. Media, entertainment, and industrial entities can implement
embodiments to
monitor for trademark infringement, captioning, advertising and targeting,
brand and product
monitoring and data collection, and marketing analytics. These exemplary
implementation are not
intended to be limiting, merely exemplary.
Additionally, or alternatively, to actively fetching and scraping a video, the
system and
method can be automated as a push system and/or a web crawling system. For
example, the server
can monitor online content of specific providers, such as YouTube, Vimeo, the
growing myriad of
video-content creating websites, or other online video providers. Monitoring
of published videos
can be tailored to search for extracted data relevant to specific requesters.
For example, a purveyor
of certain products can be apprised in real-time of new content relevant to
the products. Such
relevant content can include the context in which the products are found in
the video, the
appearance of competing products, verification of product placement, and other
useful
information.
All of the methods disclosed and claimed herein can be made and executed
without
undue experimentation in light of the present disclosure. While the apparatus
and methods of this
invention have been described in terms of preferred embodiments, it will be
apparent to those of
skill in the art that variations may be applied to the methods and in the
steps or in the sequence of
steps of the method described herein without departing
18
Date Recue/Date Received 2021-06-28
CA 02920795 2016-02-08
WO 2015/120351 PCT/US2015/014940
from the concept, spirit and scope or the invention. In addition, from the
foregoing it will
be seen that this invention is one well adapted to attain all the ends and
objects set forth
above, together with other advantages. It will be understood that certain
features and sub-
combinations are of utility and may be employed without reference to other
features and
sub-combinations. This is contemplated and within the scope of the appended
claims. All
such similar substitutes and modifications apparent to those skilled in the
art are deemed
to be within the spirit and scope of the invention as defined by the appended
claims.
19