Patent 3020591 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

At the time the application is open to public inspection;
At the time of issue of the patent (grant).

(12) Patent:	(11) CA 3020591
(54) English Title:	A SYSTEM AND METHOD FOR NETWORK INCIDENT IDENTIFICATION, CONGESTION DETECTION, ANALYSIS, AND MANAGEMENT
(54) French Title:	SYSTEME ET PROCEDE D'IDENTIFICATION D'INCIDENT DE RESEAU, AINSI QUE DE DETECTION, D'ANALYSE ET DE GESTION D'ENCOMBREMENT
Status:	Granted

Bibliographic Data

(51) International Patent Classification (IPC):	H04L 41/0631 (2022.01) H04L 41/0659 (2022.01) H04L 41/142 (2022.01) H04L 43/16 (2022.01) H04L 12/24 (2006.01) H04L 12/26 (2006.01)
(72) Inventors :	ZAFER, MURTAZA (United States of America) SRINIVAS, ANAND (United States of America) HOSSAIN, S M S (United States of America) CHANDRASEKARAN, BALACHANDER (United States of America)
(73) Owners :	VMWARE LLC (United States of America)
(71) Applicants :	NYANSA, INC. (United States of America)
(74) Agent:	RICHES, MCKENZIE & HERBERT LLP
(74) Associate agent:
(45) Issued:	2024-02-20
(86) PCT Filing Date:	2017-04-18
(87) Open to Public Inspection:	2017-10-26
Examination requested:	2020-04-27
Availability of licence:	N/A
(25) Language of filing:	English

Patent Cooperation Treaty (PCT):	Yes
(86) PCT Filing Number:	PCT/US2017/028173
(87) International Publication Number:	WO2017/184627
(85) National Entry:	2018-10-10

(30) Application Priority Data:

Application No.	Country/Territory	Date
15/132,049	United States of America	2016-04-18
15/132,051	United States of America	2016-04-18
15/132,057	United States of America	2016-04-18

Abstracts

English Abstract

A system and method for automatic detection of a network incident from real-time network data is disclosed. The method includes: collecting real-time network data; executing performance calculations on the real-time network data to compute performance metrics; and detecting a pattern over a time window, wherein detecting a pattern includes detecting a proportion of metric values crossing a threshold exceeding a defined percentage amount, detecting a presence of a sequence of metric values, detecting a time-ordered stretch of metric values with a length of the time-ordered stretch exceeding a defined threshold, detecting a cyclical presence of a sequence of metric values, or combinations thereof.

French Abstract

L'invention concerne un système et un procédé de détection automatique d'un incident de réseau à partir de données de réseau en temps réel. Le procédé comprend : la collecte de données de réseau en temps réel ; l'exécution de calculs des performances sur les données de réseau en temps réel afin de calculer des métriques des performances ; et la détection d'un motif sur une fenêtre temporelle, la détection d'un motif comprenant la détection d'une proportion de valeurs métriques dépassant un seuil excédant une quantité de pourcentage définie, la détection d'une présence d'une séquence de valeurs métriques, la détection d'un étirement ordonné dans le temps de valeurs métriques avec une longueur de l'étirement ordonné dans le temps excédant un seuil défini, la détection d'une présence cyclique d'une séquence de valeurs métriques, ou leurs combinaisons.

Claims

Note: Claims are shown in the official language in which they were submitted.

We Claim:
1. A method for automated detection of a congestion incident from real-time
network data in
a network, the method comprising:
collecting real-time network data from a plurality of network infrastructure
elements;
executing calculations on the real-time network data to compute performance
metrics;
detecting congestion incidents from the executed calculations on the real-time
network
data, wherein detecting congestion incidents comprises identifying a
persistence of one or more
metrics over a time window;
identifying one or more root causes of a set of one or more congestion
incidents from the
detected congestion incidents;
determining remediation of the identified one or more root causes of the set
of congestion
incidents, said remediation comprising upgrades to a set of network
infrastructure elements in the
plurality of network infrastructure elements; and
automatically implementing the determined remediation by programming, via a
controller,
configuration settings of the set of network infrastructure elements, said
programming of the
configuration settings comprising upgrading one or more network infrastructure
elements in the
set of network infrastructure elements.
2. The method of claim 1, wherein detecting congestion incidents from the
executed
calculations on the real-time network data further comprises:
identifying a peak usage time for a group of network infrastructure elements
by detecting
a current usage of the group exceeding a threshold that is based on
statistical analysis of historical
data; and
identifying the congestion incidents based on the identified peak usage time.
- 64 -
CA 3020591 2023-05-15

3. The method of claim 2, wherein detecting congestion incidents from the
executed
calculations on the real-time network data further comprises:
identifying a peak usage time for a group of network infrastructure elements
by detecting
a current usage of the group based on network activity specific to a
respective network service;
and
identifying the congestion incidents based on the identified peak usage time.
4. The method of claim 3, wherein detecting the current usage of the group
based on the
network activity specific to the respective network service comprises at least
one of: (1) detecting
Wi-Fi usage as a number of clients that accessed a Wi-Fi network, (2)
detecting RADIUS (Remote
Authentication Dial-In User Service) usage as a number of client devices that
requested
authentication from a RADIUS server, (3) detecting DHCP (Dynamic Host
Configuration
Protocol) usage as a number of client devices that requested an IP (Internet
Protocol) address from
a DHCP server, and (4) detecting Internet usage as a number of client devices
that accessed the
Internet.
5. The method of claim 1, wherein identifying the one or more root causes
of the set of
congestion incidents further comprises analyzing the detected congestion
incidents and predicting
performance issues in the network.
6. The method of claim 5, wherein predicting performance issues in the
network comprises
at least one of: (1) identifying, for Wi-Fi congestion, congested access
points in the network, (2)
identifying, for RADIUS (Remote Authentication Dial-In User Service)
congestion, servers that
are overloaded, (3) identifying, for DHCP (Dynamic Host Configuration
Protocol) congestion,
servers that are overloaded, and (4) identifying, for Web congestion, network
links that have low
bandwidth compared to usage.
- 65 -
CA 3020591 2023-05-15

7. The method of claim 1, wherein determining remediation comprises
identifying the
upgrades to the set of network infrastructure elements including at least one
of: (1) Wi-Fi access
point upgrades, (2) network services server upgrades, (3) internet link
upgrades, and (4) network
capacity upgrades including adding more network infrastructure elements.
8. The method of claim 1 further comprising performing cross-company
comparison of
congestion incidents among similar companies.
9. The method of claim 8 further comprising identifying a similar set of
companies based on
at least one of: (1) industry vertical markets, (2) a number of clients, (3) a
number of access points,
(4) types of devices in each network, (5) types of network elements in each
network, (6) density
of network deployment, (7) topology of network deployment, and (8)
geographical locations.
10. The method of claim 8 further comprising performing cross-company
comparison by
comparing a client perception of infrastructure, wherein the client perception
of infrastructure
includes a client-centric view of infrastructure factors including type,
configuration, and state.
11. The method of claim 10, wherein client perception factors for Wi-Fi
congestion comprise
at least one of: (1) AP (Access Point) load balancing, (2) AP band steering,
(3) AP DFS (Dynamic
Frequency Selection) channels, (4) percentage of 5GHz capable clients using
2.4GHz channels,
(5) percentage of 5GHz capable clients, (6) channel width, (7) controller
model, (8) controller
software, (9) Wi-Fi protocol, (10) AP models, (11) distribution of AP
neighbors on an AP, (12)
distribution of clients on the AP, (13) client perception factors for RADIUS
(Remote
Authentication Dial-In User Service) congestion comprising type of server,
server software
version, rate of transactions on the server, and load balancing across
multiple servers, (14) client
perception factors for DHCP (Dynamic Host Configuration Protocol) congestion
comprising type
of server, server software version, rate of transactions on the server, IP
(Internet Protocol) address
- 66 -
CA 3020591 2023-05-15

space size on a VLAN (Virtual Local Area Network), and load balancing across
multiple servers,
and (15) client perception factors for DNS (Domain Name Server) comprising
rate of transactions
on the server and load balancing across multiple servers.
12. The method of claim 1, wherein identifying the root causes of the set
of congestion
incidents comprises correlating a sequence of performance metrics with
composite metrics that
define relevant symptoms.
13. The method of claim 12 further comprising aggregating correlations of
symptoms across
many clients and mapping a set of symptom correlations to likely root causes.
14. The method of claim 1 further comprising identifying a particular root
cause by aggregating
root causes of a plurality of congestion incidents comprising a persistence of
one or more metrics
over a longer time period than the time window.
15. The method of claim 1, wherein the set of network infrastructure
elements includes one or
more of wireless LAN (Local Area Network) controllers, access points,
switches, routers, and
firewalls.
16. The method of claim 1 further comprising identifying outlying subgroups
within a group
of entities associated with the detected congestion incidents.
17. The method of claim 1 further comprising computing a historical time-
series data based on
detected congestion incidents over time.
18. The method of claim 17 further comprising identifying deviations in the
computed
historical time-series data.
19. The method of claim 17 further comprising identifying changes in the
computed historical
time-series data and identifying factors contributing to the identified
changes, wherein the factors
- 67 -
CA 3020591 2023-05-15

comprise at least one of: (1) configuration changes, (2) topology changes, and
(3) the upgrades to
the set of network infrastructure elements in the network.
20. The method of claim 1, wherein the real-time network data includes at
least one of: (1) data
obtained from deep packet analysis of real-time network traffic and (2) data
from the network
infrastructure elements including wireless controllers, access points, and on-
premise network
systems.
21. The method of claim 1, wherein the real-time network data includes
device information,
OS (operating system) information, wireless metrics, Layer 2 through Layer 4
information,
network services related Layer 7 information, Layer 5 through Layer 7
information, and
application related data.
22. The method of claim 21, wherein the wireless metrics include at least
one of: (1) SNR
(signal to noise ratio), (2) packet loss, (3) packet retransmits, (4)
connected access points, (5)
neighboring access points information, (6) rogue access points information,
(7) outside-network
access points information, (8) interference information in RF (radio
frequency) bands, (9) network
topology information, (10) CPU (central processing unit) utilization of
wireless controllers and
access points, and (11) memory-utilization of wireless controllers and access
points.
23. The method of claim 21, wherein the Layer 2 through Layer 4 information
includes at least
one of: (1) VLAN (Virtual Local Area Network) data, (2) receive bytes or
packets, (3) transmit
bytes, (4) transmit packets, (5) DSCP (Differentiated Services Code Point)
data, (6) RTP (Real-
time Transport Protocol), (7) SRTP (Secure Real-time Transport Protocol), (8)
FTP (File Transfer
Protocol) protocol related data, (9) latency data, and (10) retransmits data.
- 68
CA 3020591 2023-05-15

A it
24. The method of claim 21, wherein the network services related Layer 7
information include
at least one of: (1) DHCP (Dynamic Host Configuration Protocol), (2) DNS
(Domain Name
System), (3) response times, and (4) failure codes.
25. The method of claim 21, wherein the Layer 5 through Layer 7 information
includes at least
one of: (1) a SSL (Secure Sockets Layer) Certificate and (2) HTTP (Hypertext
Transfer Protocol)
response times.
26. The method of claim 21, wherein the real-time network data pulled from
on-premise
systems includes at least one of: (1) Netflow, (2) Snow, and (3) SNMP (Simple
Network
Management Protocol) records from routers, switches, and network elements.
27. A machine-readable medium storing a program for execution by at least
one processing
unit, the program comprising sets of instructions for implementing the method
according to any
one of claims 1 to 26.
28. An electronic device comprising:
a set of processing units; and
a machine-readable medium storing a program for execution by at least one of
the
processing units, the program comprising sets of instructions for implementing
the method
according to any one of claims 1 to 26.
- 69
CA 3020591 2023-05-15

Description

Note: Descriptions are shown in the official language in which they were submitted.

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
A SYSTEM AND METHOD FOR NETWORK INCIDENT IDENTIFICATION,
CONGESTION DETECTION, ANALYSIS, AND MANAGEMENT
Copyright Notice
A portion of the disclosure of this patent document contains material that
is subject to copyright protection. The copyright owner has no objection to
the
facsimile reproduction by anyone of the patent document or the patent
disclosure, as it
appears in the Patent and Trademark Office patent files or records, but
otherwise
reserves all copyright rights whatsoever.
Technical Field
The present disclosure pertains generally to the fields of identifying and
analyzing networks incidents. More particularly, the present disclosure
relates to a
system and method for identifying, analyzing, and remediating networks
incidents.
Description of the Related Art
Obtaining business-level insight and control over the applications, users
and devices in modern networks is becoming extremely challenging. On the
applications front, modern networks have a huge mix in application types and
deployment locations. For example, business applications may be hosted off-
premise in
the cloud (e.g., salesforce.com), on-premise in a local data center (e.g.,
SAP), or on-
premise between hosts (e.g., unified communications). On the users and devices
front,
modern networks are accessed by a myriad of devices from wired desktops to
wireless
devices such as laptop computers, mobile phones, and tablet PCs.
Traditional performance monitoring or analytics tools work in silos on
individual layers of the network stack and do not analyze correlated
information across
the multiple layers of the network stack to provide a comprehensive view of
the
network performance from end-user perspective. Moreover, they do not perform
analysis across multiple disparate networks to yield better insights for any
given
network environment.
1

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
BRIEF SUMMARY
One embodiment of the present disclosure is a method for automatic
detection of a network incident. The method includes: collecting real-time
network
data; executing performance calculations on the real-time network data to
compute
performance metrics; and detecting a pattern over a time window, wherein
detecting a
pattern includes detecting a proportion of metric values crossing a threshold
exceeding
a defined percentage amount, detecting a presence of a sequence of metric
values,
detecting a time-ordered stretch of certain metric values exceeding a defined
threshold,
detecting a cyclical presence of a sequence of metric values, or combinations
thereof.
Some embodiments of the disclosed methods further include detecting a
client network incident composed of one or many clients and/or an
infrastructure
network incident composed of one or many infrastructure elements. Other
embodiments of the disclosed methods further include identifying root causes
of a
network incident. Still other embodiments of the disclosed methods further
include
identifying root causes of a network incident by correlating a sequence of
performance
metrics with other composite metrics that define relevant symptoms. Yet other
embodiments of the disclosed methods further include aggregating the
correlations of
symptoms across many clients and/or infrastructure elements, and mapping a set
of
symptom correlations to probable root causes. Another embodiment of the
disclosed
methods further includes identifying root causes of a network incident by
aggregating
root causes of network incidents over a longer time period. Still another
embodiment of
the disclosed methods further includes identifying remediation for the network

incidents.
Continuing, some embodiments of the disclosed methods further include
identifying outlying subgroups a group of entities in the network incident.
Other
embodiments of the disclosed methods further include computing a historical
time-
series data based on computed network incidents over time. Still other
embodiments of
the disclosed methods further include identifying deviations in the computed
historical
time-series data. Yet other embodiments of the disclosed methods further
include
identifying changes in the computed historical time-series data and/or
identifying
factors contributing to these changes.
2

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Additionally, some embodiments of the disclosed methods further
include performing cross-company comparison of network incidents among similar

companies. Other embodiments of the disclosed methods further include
performing
cross-company comparison of contributing factors to the network incidents,
including
client perception of an infrastructure. Still other embodiments of the
disclosed methods
further include identifying a similar set of companies based on network
factors,
including a number of wired/wireless clients, a number of access points, types
of
devices and/or network elements in each network, a density and/or topology of
network
deployment, geographical locations, or combinations thereof. In yet other
embodiments
.. of the disclosed methods, the real time network data includes: data
obtained from deep
packet analysis of real time network traffic; data from network elements
including
wireless controllers, access points, or on-premise network systems; or
combinations
thereof.
Moreover, in some embodiments the real time network data includes:
device/OS information, wireless metrics, Layer 2 through Layer 4 information,
network
services related Layer 7 information, Layer 5 through Layer 7 information, and

application related data, or combinations thereof. In other embodiments, the
wireless
metrics include SNR (signal to noise ratio), packet loss/retransmits,
connected access
points, channel utilization at the access points, neighboring access points
information,
rogue/outside-network access points information, interference information in
the RF
bands, network topology information, CPU/memory-
utilization/reboots/configuration of
wireless controllers and access points, or combinations thereof. In still
other
embodiments, the Layer 2 through Layer 4 information includes VLAN (virtual
local
area network) data, receive/transmit bytes/packets, DSCP (Differentiated
Services Code
Point) data, TCP (transmission control protocol)/UDP (User Datagram
Protocol)/RTP
(Real-time Transport Protocol)/SRTP (Secure Real-time Transport Protocol)/FTP
(File
Transfer Protocol) protocol related data such as latency/retransmits, or
combinations
thereof.
In some embodiments of the disclosed methods, the network services
.. related Layer 7 information includes DHCP (Dynamic Host Configuration
Protocol),
DNS (Domain Name System), RADIUS (Remote Authentication Dial-In User Service)
protocol information such as response times and failure codes, or combinations
thereof
3

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
In other embodiments of the disclosed methods, the Layer 5 through Layer 7
information includes SSL (Secure Sockets Layer) Certificate, hostnames, HTTP
(Hypertext Transfer Protocol) response times, or combinations thereof. In
still other
embodiments of the disclosed methods, the real time network data pulled from
on-
premise systems includes call quality records provided by UC (Unified
Communications) systems that include client MOS (Mean Opinion Score) scores,
network latency, packet jitter, packet loss, abnormal call terminations,
failure records,
or combinations thereof real time network data pulled from on-premise systems
including: Netflow/Sflow/SNMP (Simple Network Management Protocol) records
from
routers, switches, network elements, or combinations thereof
Still another embodiment of the present disclosure is directed towards
analyzing and remediating network incidents. The method includes: executing
performance calculations on the real-time network data; detecting network
incidents
from the executed performance calculations on the real-time network data,
wherein the
network incidents include client network incidents and/or infrastructure
network
incidents; identifying one or more root causes of a detected network incident
pattern;
and determining remediation of the identified one or more root causes of the
detected
network incident pattern.
One embodiment of the present disclosure is a method for automated
detection of a congestion incident in an enterprise network. The method
includes:
collecting real-time network data; executing calculations on the real-time
network data
to compute performance metrics; detecting peak usage time; and detecting one
or more
congestion incidents, wherein a congestion incident comprises a persistence of
one or
more metrics over a time window that comprises detecting a proportion of
metric values
crossing a threshold that exceeds a defined percentage amount, detecting a
time-ordered
stretch of metric values that exceeds a defined threshold, or combinations
thereof
In some embodiments of the disclosed methods, the detecting of peak
usage time further comprises detecting current usage exceeding an absolute or
relative/percentile threshold based on statistical analysis of historical
data. Other
embodiments of the disclosed methods further include detecting usage based on
network activity specific to a respective network service. Still other
embodiments of the
disclosed methods further include detecting Wi-Fi usage as a number of clients
that
4

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
accessed the Wi-Fi network, detecting RADIUS (Remote Authentication Dial-In
User
Service) usage as a number of client devices that requested authentication
from a
RADIUS server, detecting DHCP (Dynamic Host Configuration Protocol) usage as a

number of client devices that requested an IP (Internet Protocol) address from
a DHCP
server, detecting Internet usage as a number of client devices that accessed
the Internet,
or combinations thereof Yet other embodiments of the disclosed methods further

include performing cross-company comparison of congestion incidents among
similar
companies.
Continuing, some embodiments of the disclosed methods further include
identifying a similar set of companies based on factors such as industry
vertical
markets, a number of wired/wireless clients, a number of access points, types
of devices
and/or network elements in each network, density and/or topology of network
deployment, geographical locations, or combinations thereof. Other embodiments
of
the disclosed methods further include performing cross-company comparison by
comparing a client perception of infrastructure, wherein a client perception
of
infrastructure includes a client-centric view of infrastructure factors
including type,
configuration, and state. Still other embodiments of the disclosed methods
further
include client perception factors for Wi-Fi congestion that comprise AP
(Access Point)
load balancing, AP band steering, AP DFS (Dynamic Frequency Selection)
channels,
percentage of 5GHz capable clients using 2.4GHz channels, percentage of 5GHz
capable clients, channel width, controller model, controller software, Wi-Fi
protocol,
AP models, distribution of AP neighbors and distribution of clients on an AP,
client
perception factors for Radius congestion comprising rate of transactions on
the server
and load balancing across multiple servers, client perception factors for DHCP
congestion comprising rate of transactions on the server, IP address space
size on the
VLAN (Virtual Local Area Network) and load balancing across multiple servers,
client
perception factors for DNS comprising rate of transactions on the server and
load
balancing across multiple servers, or combinations thereof.
Additionally, some embodiments of the disclosed methods further
include identifying root causes of a congestion incident. Other embodiments of
the
disclosed methods further include identifying root causes of the congestion
incident by
correlating a sequence of performance metrics with composite metrics that
define
5

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
relevant symptoms. Still other embodiments of the disclosed methods further
include
aggregating the correlations of symptoms across many clients and mapping a set
of
symptom correlations to likely root causes. Yet other embodiments of the
disclosed
methods further include identifying root causes by aggregating the root causes
of
congestion incidents over a longer time period than the time window.
Moreover, some embodiments of the disclosed methods further include
identifying remediation for the congestion incidents. Other embodiments of the

disclosed methods further include identifying outlying subgroups within a
group of
entities associated with the congestion incidents. Still other embodiments of
the
disclosed methods further include computing a historical time-series data
based on
computed congestion incidents over time. Yet other embodiments of the
disclosed
methods further include identifying deviations in the computed historical time-
series
data. Another embodiment of the disclosed methods further includes identifying

changes in the computed historical time-series data and/or identifying factors
contributing to the identified changes comprising configuration changes in the
enterprise network.
Furthermore, in some embodiments of the disclosed methods, the real
time network data includes: data obtained from deep packet analysis of real
time
network traffic, data from network elements including wireless controllers,
access
points, on-premise network systems, or combinations thereof. In other
embodiments of
the disclosed methods, the real time network data includes: device/OS
information,
wireless metrics, Layer 2 through Layer 4 information, network services
related Layer 7
information, Layer 5 through Layer 7 information, and application related
data. In still
other embodiments of the disclosed methods, the wireless metrics include SNR
(signal
to noise ratio), packet loss/retransmits, connected access points, channel
utilization at
the access points, neighboring access points information, rogue/outside-
network access
points information, interference information in the RF bands, network topology

information, CPU/memory-utilization/reboots/configuration of wireless
controllers and
access points, or combinations thereof. In yet other embodiments of the
disclosed
methods, the Layer 2 through Layer 4 information includes VLAN data,
receive/transmit bytes/packets, DSCP (Differentiated Services Code Point)
data, TCP
(transmission control protocol)/UDP (User Datagram Protocol)/RTP (Real-time
6

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Transport Protocol)/SRTP (Secure Real-time Transport Protocol)/FTP (File
Transfer
Protocol) protocol related data such as latency/retransmits, or combinations
thereof
Continuing, in some embodiments of the disclosed methods further
include network services related Layer 7 information includes DHCP, DNS
(Domain
Name System), RADIUS protocol information such as response times and failure
codes, or combinations thereof. In other embodiments of the disclosed methods,
the
Layer 5 through Layer 7 information includes SSL (Secure Sockets Layer)
Certificate,
hostnames, HTTP (Hypertext Transfer Protocol) response times, or combinations
thereof. In still other embodiments of the disclosed methods, the real time
network data
pulled from on-premise systems includes call quality records provided by UC
(Unified
Communications) systems that include client MOS (Mean Opinion Score) scores,
network latency, packet jitter, packet loss, abnormal call terminations,
failure records,
or combinations thereof real time network data pulled from on-premise systems
including: Netflow/Sflow/SNMP (Simple Network Management Protocol) records
from
routers, switches, network elements, or combinations thereof
Another embodiment of the present disclosure is a method for automated
detection of a congestion incident in an enterprise network. The method
includes:
locally collecting real-time network data; executing calculations on the real-
time
network data; determining performance metrics from the executed calculations
on the
real-time network data; detecting peak usage time; detecting one or more
congestion
incidents, wherein a congestion incident comprises a persistence of one or
more metrics
over a certain time window that comprises detecting the proportion of metric
values
crossing a threshold itself exceeding a defined percentage amount, detecting a
time-
ordered stretch of certain metric values exceeding a defined threshold, or
combinations
thereof; identifying root causes of the one or more congestion incidents; and
predicting
future congestion incidences based on the one or more congestion incidents.
Another embodiment of the present disclosure is a method for network
congestion analysis and management in an enterprise network. The method
includes:
locally collecting real-time network data; executing performance calculations
on the
real-time network data; detecting client network congestion from the executed
performance calculations on the real-time network data; identifying outlier
subgroups
within the detected client network congestion; determining one or more causes
of the
7

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
identified outlier subgroups within the detected client network congestion;
and
predicting future congestion incidences based on current client network
congestion.
One embodiment of the present disclosure is directed towards a method
of extracting information from real-time network packet data to analyze
connectivity
data for client devices in a network. The method includes: detecting when
client
devices initiate a connectivity event; after detecting a connectivity event,
waiting a
period of time for the client device to reach a network connected state; after
waiting a
period of time, recording connectivity event information; and sending the
recorded
connectivity event information to an analytics system for network incident
and/or
network congestion analysis.
In some embodiments of the disclosed method, the connectivity event
comprises a connectivity event determined by looking for a Remote
Authentication
Dial-In User Service (RADIUS) request, Dynamic Host Configuration Protocol
(DHCP) discover message, or DHCP request packet. In other embodiments, a
network
connected state comprises the client device authenticated with the network,
having a
valid IP address, and being able to communicate with internal and external
computer
hosts. In still other embodiments, being able to communicate with internal and
external
hosts includes Transmission Control Protocol (TCP) or User Datagram Protocol
(UDP)
traffic transmitted either to a public IP address, or to a different subnet.
In yet other
embodiments, the connectivity event information includes: RADIUS last state,
DHCP
last state, ARP (Address Resolution Protocol) last state with respect to
default gateway,
DNS (Domain Name Server) response, communication with internal or external
computer hosts, or combinations thereof.
Continuing, in some embodiments of the disclosed method, a RADIUS
last state includes failed, success, timeout, or combinations thereof In other
embodiments, a DHCP last state includes Offer, Acknowledge, No Acknowledge,
timeout, or combinations thereof. In still other embodiments, an ARP last
state with
respect to default gateway includes response, timeout, or combinations
thereof. In yet
other embodiments, a DNS last state includes response, timeout, or
combinations
thereof.
8

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Another embodiment of the present disclosure is directed towards
extracting information from real-time network data to analyze roaming data for
client
devices in a network. The method includes: detecting when client devices
initiate an
association event; buffering data regarding the detected association events;
analyzing
the buffered data and the association events; and sending the roaming event
information
to an analytics system for network incident and/or network congestion
analysis.
Additionally, some embodiments of the disclosed method further include
detecting an association event as an on-the-wire association event by looking
for a
Remote Authentication Dial-In User Service (RADIUS) request with metadata
indicating an association with a particular access point, Service Set
Identifier (SSID), or
combination thereof. In other embodiments of the disclosed method, the
buffering of
data comprises buffering TCP (Transmission Control Protocol) traffic before
and after
the association event, TCP flow related information such as retransmitted
packets, RST
(Reset) and SYN (Synchronize) packets. In still other embodiments of the
disclosed
method, analyzing the buffered data comprises determining, if there was
traffic before
the event, the time span between the last traffic packet before the
association event and
the first traffic packet after the association event. In yet other embodiments
of the
disclosed method, analyzing the buffered data comprises determining, if there
was TCP
traffic before the event, the presence of a high percentage of TCP
retransmitted packets.
In another embodiment of the disclosed method, analyzing the buffered data
comprises
determining, if there was TCP traffic before the event, the presence of a high
percentage
of TCP reset and synchronization packets.
Moreover, some embodiments of the disclosed method further include
analyzing roaming data for client devices in a network without on-the-wire
association
events by computing roaming event information based on short, sliding window
periods
of time. In other embodiments of the disclosed method, the real-time network
data
includes: data obtained from deep packet analysis of real-time network
traffic, data
from network elements including wireless controllers, access points, on-
premise
network systems, or combinations thereof. In still other embodiments of the
disclosed
method, the real-time network data includes: device/OS (Operating System)
information, wireless metrics, Layer 2 through Layer 4 information, network
services
9

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
related Layer 7 information, Layer 5 through Layer 7 information, and
application
related data. In yet other embodiments of the disclosed method, the wireless
metrics
include SNR (signal to noise ratio), packet loss/retransmits, connected access
points,
channel utilization at the access points, neighboring access points
information,
rogue/outside-network access points information, interference information in
the RF
(Radio Frequency) bands, network topology information, CPU (Central Processing

Unit)/memory-utilization/reboots/configuration of wireless controllers and
access
points, or combinations thereof.
In another embodiment of the disclosed method, the Layer 2 through
Layer 4 information includes VLAN (Virtual Local Area Network) data,
receive/transmit bytes/packets, DSCP (Differentiated Services Code Point)
data, TCP
(transmission control protocol)/UDP (User Datagram Protocol)/RTP (Real-time
Transport Protocol)/SRTP (Secure Real-time Transport Protocol)/FTP (File
Transfer
Protocol) protocol related data such as latency/retransmits, or combinations
thereof In
other embodiments of the disclosed method, the network services related Layer
7
information includes DHCP (Dynamic Host Configuration Protocol), DNS (Domain
Name System), RADIUS (Remote Authentication Dial-In User Service) protocol
information such as response times and failure codes, or combinations thereof.
In still
other embodiments of the disclosed method, the Layer 5 through Layer 7
information
includes SSL (Secure Sockets Layer) Certificate, hostnames, HTTP (Hypertext
Transfer
Protocol) response times, or combinations thereof. In yet other embodiments of
the
disclosed method, the real-time network data pulled from on-premise systems
includes
call quality records provided by UC (Unified Communications) systems that
include
client MOS (Mean Opinion Score) scores, network latency, packet jitter, packet
loss,
abnormal call terminations, failure records, or combinations thereof real-time
network
data pulled from on-premise systems including: Netflow/Sflow/SNMP (Simple
Network Management Protocol) records from routers, switches, network elements,
or
combinations thereof
Another embodiment of the present disclosure is also directed towards a
method of extracting information from real-time network packet data to analyze
data for
devices in a network. The method includes: detecting when client devices
initiate a

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
connectivity event; after detecting a connectivity event, waiting a period of
time for the
client device to accomplish a web event; recording connectivity event
information; and
sending the recorded connectivity event information that was packaged to an
analytics
system.
In some embodiments of the disclosed method, the real-time network
data is pulled from other on-premise systems including quality records called
by
Unified Communications systems that include client MOS scores, jitter, packet
loss,
abnormal call terminations, or combinations thereof. In other embodiments of
the
disclosed method, the real-time network data is pulled from other on-premise
systems
including CPU/Memory utilization information from DHCP (Dynamic Host
Configuration Protocol), DNS (Domain Name System), RADIUS servers, other
network elements, or combinations thereof. In still other embodiments of the
disclosed
method, the real-time network data is pulled from other on-premise systems
including:
Netflow/Sflow/SNMP (Simple Network Management Protocol) records from routers,
switches, or combinations thereof
Still another embodiment of the present disclosure is also directed
towards a method of extracting information from real-time network packet data
to
analyze data for devices in a network. This method includes: detecting when
client
devices initiate a connectivity event; after detecting a connectivity event,
waiting a
period of time for the client device to accomplish a web event; recording
connectivity
event information; and sending the recorded connectivity event information
that was
packaged to an analytics system.
The disclosed embodiments further relate to machine readable media on
.. which are stored embodiments of the disclosed invention described in
herein. It is
contemplated that any media suitable for retrieving instructions is within the
scope of
the disclosed embodiments. By way of example, such media may take the form of
magnetic, optical, or semiconductor media. The disclosed embodiments also
relate to
data structures that contain embodiments of the disclosed invention, and to
the
transmission of data structures containing embodiments of the disclosed
invention.
11

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Further aspects of the disclosed embodiments will be brought out in the
following portions of the specification, wherein the detailed description is
for the
purpose of fully disclosing the various embodiments without placing
limitations
thereon.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING
The present application will be more fully understood by reference to the
following figures, which are for illustrative purposes only. The figures are
not
necessarily drawn to scale and elements of similar structures or functions are
generally
represented by like reference numerals for illustrative purposes throughout
the figures.
The figures are only intended to facilitate the description of the various
embodiments
described herein. The figures do not describe every aspect of the teachings
disclosed
herein and do not limit the scope of the claims.
FIG. 1 illustrates system architecture of an exemplary network incident
identification, analysis, and remediation system deployed in an enterprise
network,
according to one embodiment;
FIG. 2 illustrates a functional diagram of a network, according to one
embodiment;
FIG. 3A illustrates system architecture of an exemplary system deployed
in an enterprise network, according to one embodiment;
FIG. 3B illustrates system architecture of an exemplary system deployed
in an enterprise network, according to another embodiment;
FIG. 4 is a block diagram of an out-of-band deployment, according to
one embodiment;
FIG. 5 is a block diagram of an inline deployment, according to one
embodiment;
FIG. 6 is a flow diagram for providing network visibility, according to
one embodiment;
FIG. 7 is a flow diagram of an input collection process at the collector,
according to one embodiment;
12

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
FIG. 8 illustrates a diagram of an exemplary network, according to one
embodiment;
FIG. 9 illustrates a diagram of an exemplary of legacy network including
a switch, according to one embodiment;
FIG. 10 is a flow diagram of an exemplary information collection
process, according to one embodiment;
FIG. 11 is a flow diagram of summarization and indexing processes,
according to one embodiment;
FIG. 12 is a flow diagram of a control loop, according to one
embodiment;
FIG. 13 illustrates exemplary computer architecture, according to one
embodiment;
FIG. 14 illustrates a logic flow diagram of a client network incident
identification, analysis, and remediation system and method for cross-company
real-
time network data analysis;
FIG. 15 illustrates a more detailed logic flow diagram of a client network
incident identification, analysis, and remediation system and method for cross-
company
real-time network data analysis;
FIG. 16 illustrates a logic flow diagram of a group network incident
.. identification, analysis, and remediation system and method for cross-
company real-
time network data analysis;
FIG. 17A illustrates a computer screenshot displaying a client network
incident for a wired network device;
FIG. 17B illustrates a computer screenshot displaying a client network
incident for a wireless network device;
FIG. 17C illustrates a computer screenshot displaying an infrastructure
network incident for a wired network device;
FIG. 17D illustrates a computer screenshot displaying a group network
incident for wireless network devices, as well as an Outlying Subgroups
analysis;
13

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
FIG. 17E illustrates a computer screenshot displaying an infrastructure
network incident for a wireless network device, as well as a root causes and
symptoms
analysis;
FIG. 18 illustrates a logic flow diagram of a client congestion metrics for
.. cross-company real-time network data analysis; and
FIG. 19 illustrates a logic flow diagram of an information extraction
system and method for real-time network packet data analysis for connectivity
detection
and roaming issue detection.
DETAILED DESCRIPTION
Persons of ordinary skill in the art will understand that the present
disclosure is illustrative only and not in any way limiting. Other embodiments
of the
presently disclosed system and method readily suggest themselves to such
skilled
persons having the assistance of this disclosure.
Each of the features and teachings disclosed herein can be utilized
separately or in conjunction with other features and teachings to provide a
system and
method for network incident identification and analysis. Representative
examples
utilizing many of these additional features and teachings, both separately and
in
combination, are described in further detail with reference to the attached
figures. This
detailed description is merely intended to teach a person of skill in the art
further details
for practicing aspects of the present teachings and is not intended to limit
the scope of
the claims. Therefore, combinations of features disclosed above in the
detailed
description may not be necessary to practice the teachings in the broadest
sense, and are
instead taught merely to describe particularly representative examples of the
present
teachings.
In the description below, for purposes of explanation only, specific
nomenclature is set forth to provide a thorough understanding of the present
system and
method. However, it will be apparent to one skilled in the art that these
specific details
are not required to practice the teachings of the present system and method.
Some portions of the detailed descriptions herein are presented in terms
of algorithms and symbolic representations of operations on data bits within a
computer
14

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
memory. These algorithmic descriptions and representations are the means used
by
those skilled in the data processing arts to most effectively convey the
substance of their
work to others skilled in the art. An algorithm is here, and generally,
conceived to be a
self-consistent sequence of steps leading to a desired result. The steps are
those
requiring physical manipulations of physical quantities. Usually, though not
necessarily, these quantities take the form of electrical or magnetic signals
capable of
being stored, transferred, combined, compared, and otherwise manipulated. It
has
proven convenient at times, principally for reasons of common usage, to refer
to these
signals as bits, values, elements, symbols, characters, terms, numbers, or the
like.
It should be borne in mind, however, that all of these and similar terms
are to be associated with the appropriate physical quantities and are merely
convenient
labels applied to these quantities. Unless specifically stated otherwise as
apparent from
the below discussion, it is appreciated that throughout the description,
discussions
utilizing terms such as "processing," "computing," "calculating,"
"determining,"
"displaying," "configuring," or the like, refer to the actions and processes
of a computer
system, or similar electronic computing device, that manipulates and
transforms data
represented as physical (electronic) quantities within the computer system's
registers
and memories into other data similarly represented as physical quantities
within the
computer system memories or registers or other such information storage,
transmission
or display devices.
The present application also relates to an apparatus for performing the
operations herein. This apparatus may be specially constructed for the
required
purposes, or it may comprise a general purpose computer selectively activated
or
reconfigured by a computer program stored in the computer. Such a computer
program
may be stored in a computer readable storage medium, such as, but not limited
to, any
type of disk, including floppy disks, optical disks, CD-ROMs, and magnetic-
optical
disks, read-only memories (ROMs), random access memories (RAMs), EPROMs,
EEPROMs, magnetic or optical cards, or any type of media suitable for storing
electronic instructions, and each coupled to a computer system bus.
The algorithms presented herein are not inherently related to any
particular computer or other apparatus. Various general purpose systems,
computer

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
servers, or personal computers may be used with programs in accordance with
the
teachings herein, or it may prove convenient to construct a more specialized
apparatus
to perform the required method steps. The required structure for a variety of
these
systems will appear from the description below. It will be appreciated that a
variety of
programming languages may be used to implement the teachings of the disclosure
as
described herein.
Moreover, the various features of the representative examples and the
dependent claims may be combined in ways that are not specifically and
explicitly
enumerated in order to provide additional useful embodiments of the present
teachings.
It is also expressly noted that all value ranges or indications of groups of
entities
disclose every possible intermediate value or intermediate entity for the
purpose of
original disclosure, as well as for the purpose of restricting the claimed
subject matter.
It is also expressly noted that the dimensions and the shapes of the
components shown
in the figures are designed to help to understand how the present teachings
are
practiced, but not intended to limit the dimensions and the shapes shown in
the
examples.
The disclosed system and method is directed towards the identification
and analysis of network incidents. One such embodiment of a disclosed method
includes: executing performance calculations on the real-time network data,
detecting
network incidents from the executed performance calculations on the real-time
network
data, performing pattern detection of the detected network incidents,
identifying one or
more root causes of a detected network incident pattern, and determining
remediation of
the identified one or more root causes of the detected network incident
pattern.
Within the context of the description of network incidents in the present
application, the phrase "network incident" is defined as the presence of a
"pattern" over
the computed values of one or more performance metrics for a certain time
window. A
pattern is defined as one of the following, (a) the proportion of metric
values that cross
a threshold exceeding a defined percentage amount, (b) the presence of a
certain
sequence of metric values, (c) time-ordered stretch of certain metric values
with the
length of the stretch exceeding a defined threshold, and (d) presence of a
cyclical
sequence of metric values. Network incidents are further classified herein
either as a
16

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Client network incident or as an Infrastructure network incident. A client
network
incident is a network incident where the computed metrics relate to the
calculations on
data that relates to a user/client in the network. An infrastructure network
incident is a
network incident where the computed metrics relate to the calculations on data
that
relates to an infrastructure element in the network.
Within the context of the present application, the term "real-time" is
defined as collecting network data related to a wired or wireless network
flow/transaction and performing calculations on the network data within a
short
duration (e.g., seconds or minutes) of the occurrence of the network
flow/transaction.
Continuing, within the context of the present application, the phrase
"client perception of the infrastructure" is defined as a client-centric view
of
infrastructure factors including type, configuration, and state. For type, an
example
would be "60 percent of clients are connected to wireless controllers that are
model X"
instead of "60 percent of the wireless controllers are of model X." For
configuration,
an example would be "55 percent of clients are connected to a DHCP server with
IP
conflict detection enabled." Finally, an example of state would be "70 percent
of
clients are connected to an access point with greater than 100 simultaneously
associated
clients."
As shown in FIG. 1, a functional diagram of an enterprise network is
illustrated, according to one embodiment. The network includes switches,
wireless
controllers, routers, and a firewall. All of these elements can be physical or
virtual, and
the wireless controller can be logical or physical, located either on-premise
in its own
physical appliance, within an AP, or in the cloud.
Referring now to FIG. 2, a functional diagram of a network is illustrated,
according to one embodiment. The network 100 includes applications 110 and
network
elements 120 that are linked via one or more controllers 115A-115N. The "p"
and "v"
prefixes on the network elements 120 indicate physical and virtual network
elements,
respectively.
The network elements 120 include a physical switch (pSwitch) 121, a
physical router (pRouter) 122, a physical Firewall (pFirewall), a virtual
switch
(vSwitch) 124, a virtual firewall (vFirewall) 125, and a physical network
packet broker
17

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
126. It is appreciated that the network elements 120 can include any number of

physical switches 121, physical routers 122, physical firewalls 123, virtual
switches
124, virtual firewalls 125, and physical network packet brokers 126, and other
physical
or virtual network elements, without deviating from the present disclosure.
Network functions virtualization (NFV) refers to the implementation and
deployment of software-based network elements. Such software-based network
elements typically run on generic processing hardware (e.g., x86 machines) as
opposed
to non-NFV network elements that require dedicated hardware (e.g., Application-

Specific Integrated Circuits (ASICs)). Examples of NFV-type network elements
include, but are not limited to, a virtual switch 124 and a virtual firewall
125. It is
appreciated that other types of NFV-type network elements may be implemented
without deviating from the present disclosure. Such NFV-type network elements
may
be run as a virtual machine on top of a hypervisor that runs on commodity
hardware.
The present system and method provides monitoring and controlling of NFV
and/or
non-virtualized network elements.
An application delivery network (ADN) encapsulates several
technologies that provide application-layer functionality in the network. A
next
generation application firewall, for example, is an appliance that provides
inline access
control functionality as a function of L4-L7 header information as well as
application,
user and content layer metadata. This appliance can perform inline deep packet
inspection to identify applications in real-time and perform access control.
The control embodiments of the present system and method provide
capabilities of controlling basic network elements such as switches and
routers and
wireless controllers, to the end of programming policies including security
and/or
quality related. The present system and method can reduce hardware and
distributed
functionality.
The network packet broker 126 (or a matrix switch) gathers, aggregates
and filters network traffic from port mirrors, network TAPs, and probes. The
network
packet broker 126 serves the filtered network traffic to network security and
performance tools as per their network security and performance tools. For
example, a
network security and performance tool may only support 1 GBps of traffic, and
a
18

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
network packet broker 126 can be manually configured to filter and shape
traffic from a
GBps link to conform to the constraint of the network security and performance
tool.
Typically the network packet broker 126 is decoupled from the network security
and
performance tools to which it delivers the packets.
5 A portion of the present system and method performs as a network
security and performance tool. In one embodiment, the present system and
method
intelligently and dynamically programs a network packet broker 126 to gain
access to
the traffic it needs. The present system and method also summarizes and
indexes
higher-layer information about users, applications, devices, behaviors, and
the like (e.g.,
10 via machine learning), and enables the higher-layer information to be
queried using a
natural language processing technique. According to one embodiment, the
present
system and method is deployed in a cloud to enable advanced analytics and
cross-
network learning. "Cloud" herein refers to a computer and storage platform
service
hosted over a wide area network (e.g., the Internet). It is noted that both
ADN and
network security/performance monitoring tools are typically deployed on-
premise.
The present system and method observes and controls a programmable
network via higher-layer attributes and addresses the drawbacks of prior
systems for
monitoring and controlling networks. The discussion is divided into three
sections: (1)
architecture, (2) visibility, and (3) control.
Architecture
FIG. 3A illustrates system architecture of an exemplary system deployed
in an enterprise network, according to one embodiment. The system 200 includes
a
manager 201 (or Loupe Manager) and one or more collectors 202 (herein referred
to as
Loupe Collectors). In one embodiment, the collector 202 is a software
appliance
(virtual or physical) that is located on-premise. The collector 202 may be
deployed as a
single software element, or for scaling a cluster of, or several, software
elements. For
example, the collector 202 is logic in a non-transitory computer readable
memory that
can be executed by a processor to perform the actions described herein. In
other
embodiments, the collector 202 is a combination of hardware and software.
19

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
According to some embodiments, there are multiple collectors 202 per
enterprise network 210 (e.g., a campus, a data center) and multiple networks
210 and
collectors 202 per customer. Moreover, the collectors 202 can be deployed
behind
firewalls within an enterprise network 210. This enables the collectors to
easily
communicate with enterprise systems on-premise and also behind the firewall to
easily
communicate outbound with systems off-premise.
The collector 202 receives live packets captured directly from physical
and/or virtual network elements 216. The collector 202 also receives data
(e.g.,
topology, statistics, user information, and the like) from other enterprise
systems
including identity management systems (e.g., active directory 217), network
element
controllers (e.g., controllers 215, network management systems), and the like.
The
collector 202 also runs performance tests against on/off-premise applications
in the
public cloud/Internet 250 (e.g., BOX®, MICROSOFT OFFICE365®,
GOOGLE®, WEBEX®, WORKDAY®, SALESFORCE®) and
collects the performance results.
The collector 202 captures all of these data, extracts key metadata or
features, and compresses and sends the key metadata or features to the manager
201
that is located in a public cloud 220. For example, the collector 202 receives
10 s or
100 s of gigabits per second of data, but only sends 10 s or 100 s of kilobits
per second
of data to the manager 201. The collector 202 is provisioned and configured by
the
manager 201, thus the commands from the manager 201 towards systems that are
on-
premise can be proxied via the collector 201. In one embodiment, the manager
201
may also be deployed in a private cloud or elsewhere within a large multi-site

organization.
The manager 201 summarizes and stores the data received from the
collector 202 in a database 205. The manager 201 performs additional data
collection
from off-premise enterprise systems and other applications over the public
cloud/Internet 250 and runs its own performance test. The manager 201 applies
learning and other heuristic algorithms on the data and binds higher-layer
information
(e.g., about users, applications, devices, and behaviors) to the data. The
manager 201
also computes the crawling schedule for the collectors 202 to receive data
from

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
different parts of the network. The manager 201 is also responsible for
providing an
Internet interface and a natural language query capability to retrieve ranked
answers
based on the learned data. Similar to the collector 202, the manager 201 is a
software
appliance that can be deployed in a cluster or in multiple tiers. The manager
201
contains a database 205 that can support large data storage and efficient
queries (e.g.,
BigTable®). Generally, there can be one manager 201 for many organizations

and/or enterprises (e.g., multi-tenant style deployment), or multiple managers
201 for
multiple organizations and/or enterprises. The manager 201 may also be logic
in a non-
transitory computer readable memory that can be executed by a processor to
perform
the actions described herein or a combination of hardware and software.
FIG. 3B illustrates system architecture of an exemplary system deployed
in an enterprise network, according to one embodiment. The system 250 includes
a
manager 251 on a server 255, a collector 252, a wireless controller 265 that
controls one
or more wireless access points (APs) 256. The wireless controller 265 may take
many
forms, for example, (i) a separate on-premise software running on its own
hardware, (ii)
software that is integrated into the access points 256, or (iii) software
located off-
premise (e.g., in a cloud 220). The wireless controller 265 controls and/or
configures
the access points 256 and terminates data plane traffic coming from mobile
devices that
are wirelessly connected to the access points 256. The wireless controller 265
is an
example of a network element, as well as a controller that controls several
other
network elements (e.g., access points 256).
The collector 252 collects wireless metrics from the controller 265 via a
management interface (e.g., simple network management protocol (SNMP), command-

line interface (CLI), proprietary management protocol). Examples of these
metrics for
a mobile device include, but are not limited to: signal strengths, layer 2
traffic statistics
(e.g., packets transmitted, retried, dropped), traffic transmission rates,
device location,
and user information. Examples of these metrics for an access point include,
but are not
limited to: channel utilization, aggregated layer 2 traffic statistics,
interference
measurements, CPU/memory utilization.
The collector 252 simultaneously collects metrics and other information
from other enterprise systems where available, via their respective management
21

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
interfaces. One example is collecting user role as well as user-to-IP address
information from a directory server (e.g., LDAP, Active Directory). Another
example
is collecting unified communication performance metrics from a Microsoft Lync
Server).
The collector 252 simultaneously sees network traffic via a mirrored
interface via a logical or physical port mirror off of the wireless controller
265, or a
logical or physical port mirror off of another network element (e.g., switch,
router,
access point) in the network where relevant user traffic is carried.
From the traffic, the collector 252 performs deep packet inspection (DPI)
and extracts, in addition to general protocol level metadata, user/device
quality of
experience (QoE) related metadata, differing on an application-by-application
basis.
For example, Internet browsing QoE metrics include page load times and/or HTTP

URL response times. Voice and video application QoE metrics involve extracting

and/or computing the relevant mean opinion score (MOS) values.
According to some embodiments, the present system and method time
aligns the QoE metadata with metadata extracted across the application stack
including
the wireless layer metrics from the wireless controller 265. For example at a
particular
time interval, a user/device may have poor page load times, high transmission
control
protocol (TCP) retransmits, low signal-to-noise ratio (SNR), high AP channel
utilization. The present system and method collects and stores this time
series data, and
analyzes the time series data for trends/patterns over time and other
dimensions (e.g.,
device type, location). For example, the present system and method finds that
ANDROID® devices suffer consistently worse Internet performance than
IOS® devices.
According to some embodiments, the present system and method
analyzes for trends/patterns across networks. For example, the present system
and
method identifies the specific network/protocol/wireless metrics to determine
the
application performance. As an example, the present system and method analyzes
a
bad Microsoft Lync® voice application performance (e.g., mean opinion
score
(MOS)) across many customer networks. The present system and method learns
that
the most important indicator is high levels of layer 2 packet retransmissions.
Based on
22

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
this assessment, the present system and method predicts, for a new customer
network
that has high levels of layer 2 packet retransmissions, that Microsoft
Lync®
performance would be poor unless the packet retransmissions problem is
rectified.
The present system and method has applicability to two use cases:
visibility and control. From an architecture perspective, there is a
difference between
deployment possibilities between the two use cases. In particular, for passive
visibility
only, the present system and method can be deployed out-of-band. FIG. 4 is a
block
diagram of an out-of-band deployment, according to one embodiment. A
programmable switch 324 receives mirrored traffic of network elements 316 and
communicates with a server 350 including a collector 302 and a controller 330.
The
mirrored traffic (as indicated in dotted lines) from the network elements 316
is forced
through the programmable switch 324. The programmable switch 324 can be
dynamically controlled and programmed to direct specific traffic during
specific time
intervals and network locations to the collector 302. For example, the
controller 330
controls the programmable switches 324. In a case where the total bandwidth of
the
traffic being monitored is less than the bandwidth of the collector 302, the
programmable switch 324 may not be necessary and all mirrored traffic can be
directly
sent to the collector 302. An example of this case is where only the wide area
network
(WAN) links within an enterprise network are monitored.
For control, the present system and method employs an inline
deployment, according to some embodiments. In this case, a subset of the
network
elements carrying regular traffic (e.g., non-mirrored traffic) is
programmable.
Moreover, these network elements (e.g., physical and virtual switches,
wireless access
points) may be located such that the policy can be effective, for example, to
form a
physical or logical choke point. FIG. 5 is a block diagram of an inline
deployment,
according to one embodiment. A manager 401 receives traffic from non-
programmable
network elements 416 and programmable network element 417 and communicates
with
a server 450 including a collector 402 and a controller 430. In this
embodiment, the
manager 401 is deployed on-premise in a private cloud 410, but it is apparent
that the
manager 401 can be deployed off-premise in a public cloud as illustrated in
FIGS. 3A
and 3B.
23

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
The manager 401 located in a cloud is capable of observing across
multiple customer networks. While the manager 401 (whether it is a multi-
tenant
manager or a separate manager per customer) may be deployed in a private or
public
cloud to preclude sharing of data across multiple networks, the present system
and
method may achieve overall performance improvement by combining trained
algorithms from each of the customer networks.
Visibility
The present system and method provides crawling and indexing the
network and enables natural language query about the network and applications,
users,
devices and behaviors. The specific flow for network visibility is in the
following
order:
RAW DATA ¨> CRAWLING ¨> FEATURES EXTRACTION ¨>
SUMMARIZATION ¨> INDEXING ¨> ADVANCED ANALYTICS AND CROSS-
NETWORK-LEARNING ¨> QUERY-ABILITY
FIG. 6 is a flow diagram for providing network visibility, according to
one embodiment. The RAW DATA ¨> CRAWLING ¨> FEATURES EXTRACTION
pieces occur on-premise (e.g., collectors of Company 1 (501A), a branch of
Company 1
(501B), Company 2 (502)), and the SUMMARIZATION ¨> INDEXING ¨>
ADVANCED ANALYTICS AND CROSS NETWORK LEARNING ¨> QUERY-
ABILITY pieces occur in a cloud 550 (e.g., managers of Company 1 (511) and
Company 2 (512)). It is noted that these functionalities may be split across a
manager
and a collector in a variety of ways without deviating from the scope of the
present
disclosure. For example, partial summarization can occur in the collector as
opposed to
the manager. Furthermore, the manager's functionalities can itself be broken
apart into
several pieces and implemented on a distributed computing system consisting of
multiple different components. An example is the case where the data-
summarization
and data-analytics are performed on two separate multi-machine systems that
communicate with each other.
24

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Raw Data
The raw data includes data that can be collected or crawled by a
collector or a manager. The first piece of raw data that is crawled is a live
traffic on the
network that can be accessed by one or more collectors. The raw data can
further
include statistical, topological and configuration data--received either from
network
elements directly, or via an intervening controller or a manager. Examples of
raw data
include, but are not limited to, sampled flow (sFlow®) and SNMP data
harvested
from network elements. Similarly, topology information can be gleaned from a
controller if available. Other information gleaned from other enterprise
systems (on- or
off-premise) is also applicable; for example, user information received from
an
ACTIVE DIRECTORY® server.
The raw data also includes the results from pro-active performance tests
with respect to on- and off-premise applications. In one embodiment, the
collector runs
proactive performance tests (e.g., HTTP GETs, PINGs) with various target
applications.
These target applications can be automatically detected by the present system
and
method or specifically user pre-configured.
Crawling Raw Data
Crawling herein refers to an act of dynamically selecting a different set
of raw data for the collectors to examine at any given time. For example,
crawling
includes observing different physical or virtual links, and applying different
filters to
the raw data.
In many cases, the total amount of traffic exceeds the bandwidth of a
collector. This necessitates a device with network packet broker equivalent
(NPBE)
functionality that is capable of driving mirrored and filtered traffic from
multiple parts
of the network to the collector. The present system and method dynamically
programs
one or more NPBE devices with filtering and steering rules to get selected
access to the
data. However, the present system and method also is applicable to a case
where the
traffic mirrored to the collector comes from a small number of locations
(e.g., mirrored
traffic from WAN links), and when the total simultaneous mirrored traffic is
less than
the bandwidth of the collector. This case may not require a NPBE device. In
one

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
embodiment, the NPBE is one or more software elements, for example, running as
part
of the collector.
"Crawling" the raw data is a significant problem, especially in situations
where the disclosed system and method dynamically controls one or more NPBEs
within the network to capture packets from different parts of the network at
different
times. In one embodiment, NPBE functionality is implemented by a controller
operating on top of a switch. In this case, the manager, either directly or
proxied via the
collector, can command the controller to have the underlying network elements
implement the NPBE functionality.
The method for controlling the network packet broker equivalent is for
the manager to compute a dynamic crawling and filtering schedule that informs
the
NPBE on how it may steer traffic to the collector. The computation of the
dynamic
crawling and filtering schedule may be done in a variety of ways; for example,
but not
limited to, as a function of topology, computation and network resources at
the
collector, and statistics.
An example of a dynamic crawling and filtering schedule is: Send all
ingress and egress traffic from link el to the collector; From link e2, send
ingress and
egress traffic with source or destination port equal to 80 to the collector;
and Cycle
through links e3, e4, e5 and e6, 5 minutes at a time, sending all traffic to
the collector.
A dynamic crawling and filtering schedule with more complicated logic
may be sent to the collectors. For example, collectors can be provisioned with
a
program that searches for a dynamic trigger to alter the schedule. For
example, the
dynamic trigger is: "if an application Xis detected and is using Y bandwidth,
then
monitor traffic from the link more frequently." In another embodiment, the
dynamic
crawling and filtering schedule is computed to optimize load balancing between
collectors, for example, "send the 1 GBps of traffic from link el to collector
#1 and the
1 GBps of traffic from link e2 to collector #2."
According to one embodiment, the collector crawls performance
information of on- and off-premise applications that the present system and
method
.. detects use of, or is pre-configured by a user. The performance information
may be
generated by the collector performing performance tests (e.g., PING,
TRACEROUTE,
26

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
HTTP GETs) against the applications. The performance information can be
crawled by
periodically running the same HTTP GETs against a target application that is
pre-
configured or automatically detected, and sending to the manager the detected
results.
The crawling schedule may include a command, for example, "if a new
application is
detected, then immediately start running performance tests against it."
According to some embodiments, the raw data can be collected from a
controller or a network management system in the following process: Global
view of
Li ¨> L7 Network Topology, Port statistics for each network element, if
available,
Current Configuration of each network element under control, Configuration
Capability
of each network element under control, API functionality and configuration
capabilities
of the controller itself, Any higher-layer information available regarding
users,
applications, devices, locations, and the like.
According to some embodiments, the raw data can be collected from an
enterprise system (e.g., ACTIVE DIRECTORY®, light directory access
protocol
(LDAP) servers, single sign-on (S SO) system). Examples of such raw data
include, but
are not limited to, user information such as roles and associated policies,
login status,
and IP address.
According to some embodiments, the raw data can be collected from
network elements directly (e.g., by way of a priori instructions given to a
controller) in
the following process:
Sampled mirrored traffic from various ports in the network, Advanced
statistics such as sFlow®, netFlow®, Previously computed information
regarding users, applications, devices, locations, and Signal strength, error-
rate, and
other performance related information.
According to some embodiments, the raw data can be collected from the
present system or other policy engine such as desired high-level policies.
According to
some embodiments, Performance data generated by the collector includes results
of
proactive tests (e.g., PING, HTTP, TCP) performed by the collector on detected
or user
pre-configured on/off-premise applications.
FIG. 7 is a flow diagram of an input collection process at the collector,
according to one embodiment. The input collection process starts (at 601) and
a
27

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
collector receives inputs from a manager (at 602). Examples of inputs include,
but are
not limited to: instructions on which enterprise systems to collect data from
and how to
collect the data (e.g., IP address, credentials), sampling schedule for data
collection
from network elements, instructions on initial analysis, filtering,
compression of
collected data, and list of applications to run performance test.
The collector further sends desired tapping configuration to the
controller and receives network topology (at 603), contacts the enterprise
system and
requests a stream of data to analyze (at 604), receives sampled raw data
streams
identified by time and link (at 605) and extracts features from the received
sampled raw
data streams per instructions (at 606), receives advanced statistics from
network
elements (at 607), and performs application performance tests and collects
data (at 608).
The controller further extracts features using information collected from 603-
608 and
compresses collected information (at 609). The controller sends data to the
manager (at
610), and repeats the input collection process.
Feature Extraction
According to one embodiment, the present system and method extracts
key features and/or metadata from the crawled data. For example, packets are
streaming into the collector at multiple gigabits per second speeds. The
collector
extracts a set of features on a flow-by-flow, or a host-by-host basis from
millions of
packets per seconds and tens of thousands of flows per second, and sends the
extracted
data to the manager in less than a few hundred bytes per second per flow. In
one
embodiment, a flow is defined by the 5-tuple of (src1P, dst1P, srcPort,
dstPort,
protocol). The definition of a flow may be expanded to apply to other
primitives such
as application or other combinations of packet header fields (e.g., Layer 2
flows include
source and destination media access control (MAC) addresses in the definition
of a
flow).
Examples of a flow-by-flow feature include, but are not limited to:
Number of different HTTP2xx RESPONSE packets; Number of different HTTP3xx
RESPONSE packets; Number of different HTTP5xx RESPONSE packets; Binary
feature of whether IP Traffic is present; Number of different types of HTTP
packets;
28

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Number of different types of DNS packets; Number of different types of DHCP
packets; Binary feature of whether TCP SYN was followed by TCP SYN ACK;
Binary feature of whether DNS _Q was followed by DNS SUCC RESP; Binary feature

of whether DHCP REQUEST was followed by DHCP GRANT; Set of
source/destination MAC addresses present in the flow; Each of the above
features on a
time slice by time slice basis (e.g., every 10 seconds of a flow); Mean,
median and
variance of packet inter-arrival times, payload sizes; Flag indicating whether
window
scaling was requested; Number of TCP FIN packets seen.
Examples of a host-by-host feature include, but are not limited to:
Number of different hosts a particular host interacts with; Set of hosts that
interact with
each other; Number of ports used for transactions (indicates server vs.
client).
Examples of application-level metadata include, but are not limited to:
HTTP response and page load times; Voice and video call MOS scores; Response
times
of other protocols (DNS, DHCP, RADIUS, and the like).
Small raw data (e.g., statistics, topology) can be compressed and sent to
the manager. However, intelligent feature extraction is required to send a
large data to
the manager. An example of a large data is statistical data (e.g., average
link
utilization). Similarly, the performance test results might be reduced down to
specific
features (e.g., average HTTP response time, presence of an anomaly in the
performance
test).
EXAMPLES
FIG. 8 illustrates a diagram of an exemplary network, according to one
embodiment. Seven switches s0-s6 and network elements h0-h2 are arranged
hierarchically. The top switch sO is connected to the Internet 750, and a
manager 701 is
deployed in a server in the public cloud and connected via the Internet 750. A
collector
702 is deployed as a virtual machine (VM) on a server attached to switch s6.
The
switches s0-s6 are switches and a controller 715 is deployed as a server
attached to
switch s5. An active directory server 725 is also connected to switch s5.
FIG. 9 illustrates a diagram of an exemplary of legacy network including
.. a switch, according to one embodiment. Seven switches s0-s6 s6 (and network
29

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
elements h0-h3) feed mirrored traffic (as indicated by dotted lines) into a
switch 824.
The mirroring configuration is static, and as an example may simply mirror the
traffic
from each switch's uplink. The collector 802 and controller 815 are deployed
connected to ports connected to the switch 824. The manager 801 is deployed in
a
server in the public cloud and connected to the switch sO over the Internet
850. An
active directory server 825 is also connected to switch s5. It is noted that
mirror ports
can be manually configured without the presence of a switch.
The collector 802 dynamically captures packets from multiple links in
the network. As an example, the link to the collector is a 2 GBps link (e.g.,
2 link-
aggregated IGBps links), and other links (including the WAN link) are IGBps
links. In
this case, the manager may send a crawl schedule to the collector, for
example: Collect
the features on the WAN link (e0) 100 percent of the time, and Continuously
cycle
through links e3, e4, e5, e6 (i.e., certain of the depicted links) for five
minute stretches,
and collect all the features during that time.
Summarization and Indexing
Summarization and indexing functionalities are implemented in a
manager, although it is possible to embed some or all of this functionality in
a collector
as well. The summarization and indexing processes take input features and
other
relevant data from the collector(s) and other systems. The first outputs of
the
summarization and indexing processes are higher-layer inferences, or bindings.
Specifically, the relationship or binding of higher-layer data (e.g., users,
applications,
devices) to lower layer data (e.g., IP and MAC addresses, ports) is computed
and
indexed in a database. The present system and method provides a capability to
query
using natural language and high-layer control primitives, and any high-level
indexed
information, both current and historical.
The lower layer data may vary depending on an objective such as
network visibility or network control. For network visibility, the lower layer
data
includes, but is not limited to, protocol level metrics and metadata. For
network
control, the lower layer data includes, but is not limited to, control
primitives such as
ports, MAC addresses, IP addresses, an access control list (ACL), quality of
service

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
(QoS), and rate limit setting. According to one embodiment, the present system
and
method predicts performance of one or more of an application, a user, and a
device
based on observed characteristics of the network around network protocol level
metrics
and metadata.
The main role of the summarization process is to store and learn from
the inputs received from the collector(s) and other enterprise systems. FIG.
10 is a flow
diagram of an exemplary information collection process, according to one
embodiment.
The collection process starts (at 901) as a manager obtains API functionality
and
configuration capabilities from a controller (at 902). The manager computes a
sampling
schedule as a function of a desired performance objective and topology and
sends the
sampling schedule to the collector (at 903). The manager also computes and
sends
instructions for the collector to interact with the controller and other
enterprise systems,
collect advanced statistics from network elements, and determine how to
analyze, filter,
and compress from raw data (at 904). The manager also receives raw compressed,
filtered features, and other data from the collector (at 905), and indexes and
stores the
received raw features and data in a database in terms of using time, link and
other
aspects such as source IP address, as keys (at 906). The manager also collects
high-
level policies from user via a user interface and other policy engines, and
user feedback
to aid and improve a learning algorithm (at 907).
From the set of input features and relevant input data, the present system
and method uses two background processes to summarize (i.e., extract higher-
layer
information) and index the summarized data. The incremental process acts upon
the
reception of any new raw (i.e., un-summarized) feature data or any data update
that
causes previously indexed information to be immediately erroneous (e.g., a
user
changed IP address). This process runs a heuristic classification algorithm to
summarize the raw features. The second process is a global process that runs
periodically to update a learning model (e.g., re-training the classification
algorithm), as
well as re-summarize past data. Examples of the higher-layer information
include, but
are not limited to: Users; Applications; Protocols; Device; Content; Network
and
Physical Location (Telemetry); and Derived metadata, including: Learned
relationships
between the above (e.g., User X tend to access applications of type Y, tend to
generate
31

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
Z amount of traffic), Learned attributes of the above (e.g., rate of change
vs.
"stickiness" of the relationships), Learned behaviors about the above (e.g.,
this
application appears to be having TCP issues, this user appears to be doing
something
malicious), and Learned changes in behavior of the above (e.g., this
application has had
an abnormally high set of errors, this application is using abnormally high
bandwidth).
The summarization and indexing de-duplicates data. For example, if
multiple collectors send the same data, and the manager recognizes the
duplication of
data and disambiguates. In another example, if multiple collectors see the
same
information from the same enterprise system, the manager recognizes the
duplicate
information and disambiguates.
FIG. 11 is a flow diagram of summarization and indexing processes,
according to one embodiment. The summarization and indexing process starts (at

1001) and the manager determines whether a new feature is received or there is
a
change in network topology, statistics, and user information (at 1002). The
manager
.. runs an incremental algorithm to summarize and index any raw feature data,
and runs a
re-indexer to update previously summarized and indexed data with changes of
user or
topology information (at 1003). A combination of processes is used to compute
a
higher-layer binding. The manager periodically (e.g., once per day) runs a
global re-
summarizer and re-indexer (at 1004). For example, the collector performs a
deep
packet inspection (DPI) to identify unencrypted application traffic, and the
identified
application is sent as a feature. Alternatively, the machine learning at the
manager
based on characterizing applications by the flow or host features described
earlier can
be used for encrypted traffic. User information and device information can be
gleaned
by accessing other enterprise systems such as active directory, extracting key
information from packets (e.g., user agent string, organizationally unique
identifier
(OUI)), or examining network-topology (e.g., wireless traffic comes from where
the
wireless access points are located).
Another example concerns detecting application behaviors. For
example, the machine learning at the manager can identify that the presence of
certain
packets (e.g., HTTP error packets) indicating certain types of errors.
Similarly, a
heuristic algorithm that takes into account the exact physical path the
traffic takes can
32

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
reveal other application behaviors. For example, packets are seen with
increasing inter-
arrival times as they pass through a particular switch; this indicates a
congested or
misconfigured switch. An example of the outputs of the heuristic algorithm is
a
probabilistically ranked list of higher-layer bindings.
According to one embodiment, training data is collected via user's
labelling of data. For example, a user, via a cloud portal, specifies that a
particular user
or application issue occurred recently. In another example, the present system
and
method suggests a set of possibilities for a given query. The user specifying
which, if
any, of those possibilities is the correct one is a useful training data.
Further
generalizing this, the present system and method combines algorithm insights
from
multiple networks to further enhance the classification of the collected data.

According to another embodiment, the present system and method
performs, in real time, a segment-by-segment analysis of a particular
user/application/device's traffic. To do this, the present system computes the
physical
and logical links that the traffic of interest is taking, and alters the
tapping schedule of
the collector(s) so that they collect data (e.g., packets, stats) pertaining
to the physical
links. Finally, the resultant features are indexed and analyzed in a similar
vein to
normally collected features.
Another example of summarization and indexing is computing
composite metrics from the raw features and computing and storing comparisons
of
these metrics across different dimensions. For example, the present system and
method
computes a device quality-of-experience metric from raw measures of response
times,
packet loss, etc., and compares the value of that metric against devices of
the same or
different type (e.g., iPhones), those with the same or different operating
system (e.g.,
Android), those connected to the same access point, etc. The computed, stored
and
indexed information can be quickly retrieved via a user interface query. It
can also be
used for a closed loop control with a programmable controller. The
programmable
controller controls network elements. The network manager controls the network

elements via the programmable controller.
33

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Cross Network Learning
The manager located in the cloud has access to systems from multiple
enterprises. For example, the present system is deployed as a multi-tenant
system
across customers. In such a deployment, no data is shared across customers,
but the
processes may be shared across customers.
An example of cross network learning is to train separate classifiers for
computing higher-layer bindings from the extracted features of separate
customer
networks. The separate classifiers can be combined to come up with an overall
better
classification (e.g., majority wins). Another example of cross network
learning is
learning the most common queries across networks and dedicating a higher
compute
power to have a better answer for those particular queries.
Another example of cross-network learning is based on different system
deployments that interact with each other. For example, the present system is
deployed
at customer network 1 and customer network 2 that send a lot of traffic to
each other.
The present system and method automatically detects the heavy traffic, and
runs a more
advanced performance testing algorithm directly between the collectors on both

customer networks.
Another example of cross-network learning is for predicting higher-layer
performance based on observed lower-layer characteristics of the network and
applications. For example, suppose that on one network, the present system
learned
that high AP channel utilization results in a jitter resulting in poor real-
time video
application performance. The present system detects the presence of high AP
channel
utilizations to predict poor performance for another network that may or may
not have
yet deployed a real-time video application.
Query-Ability
According to one embodiment, the present system and method provides
natural language query-ability of the network. The manager has a query box
that takes
natural language type input regarding the network and its
users/applications/devices/behaviors. Examples of natural language queries
are: "User
34

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
X is having problem Y with application Z," "User X is experiencing slowness
with
salesforce.com," and "Tell me about the SAP application."
The present system and method responds to the queries and presents a
probabilistically ranked list of answers, along with the
probabilities/confidence for each
.. answer. The present system and method also presents deeper supporting
evidence if
requested by the user.
Summary and Example
The manager receives feature data from one or more collectors at various
levels, for example, a flow level, host level, user level, and link level. The
manager
collects and indexes the collected data in terms of flow, host, user, link,
and time
intervals. As a flow of feature data arrives, the manager runs an incremental
process to
classify (a) an application that the flow corresponds to, (b) any interesting
behaviors
that the application underwent (e.g., failure to connect to a server, slow,
errors), (c) a
user involved in using the application, and (d) the devices involved in using
the
application. Additionally, the manager ties topology knowledge to an
application (e.g.,
the location of an application server, network links that the application
traffic traverses).
This information is indexed along with each feature. The collector
automatically runs
performance tests on detected or configured application servers, for example,
running
ping tests to the application servers. The performance test results are also
indexed
along with the applications and features.
According to one embodiment, the present system and method provides
a query interface (e.g., Internet interface) to a user. The user enters a
query, for
example, in a natural language form, into the user interface of the present
system. For
example, a user's query is "tell me about application X." The present system
proceeds
to perform the following steps:
i. Query the indexed database for (a) the location of
the
application (e.g., on-premise, in a cloud), (b) users who were
using the application over the last few hours, (c) the
behaviors of the application, (d) the bandwidth that the
application was using.

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
ii. Display the results of (i).
iii. Compute the links that have carried the application traffic
over the last day. Send a command to the collector to
immediately collect a ten-second sample of all traffic on all
of the links. Send commands to the programmable network
element (e.g., via a controller) and forward the traffic from
the links to the collector.
iv. Augment the previously displayed results with those found
in (iii).
Another sample query may state, "user X is having problem Y with
application Z" (i.e., tell me about it). The manager proceeds to perform the
following
steps:
i. Query the indexed database for flow instances where user X
was using application Y. Of the behaviors recorded, rank-
order the potential problem behaviors. Compare the
corresponding features across links along network paths.
Compare the features across time (i.e., historically).
ii. Display (i).
iii. Compute the links that have carried this user's application
traffic over the last day. Send a command to the collector to
immediately collect a ten-second sample of all traffic on all
of these links Send commands to the programmable network
element (e.g., via a controller) to forward the traffic from
those links to the collector.
iv. Augment the previously displayed results with those found
in (iii).
Control
According to some embodiments, the present system and method
involves using the visibility of the network and controlling the network. An
example of
controlling the network is enforcing a higher-layer policy throughout the
network.
36

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
Another example is automatic problem and security/anomaly/performance
remediation
where applicable. The present system and method may implement a network
control in
(a) a manual, or prescribed control, and (b) an automatic closed loop control.
In both
cases, one of the distinctions from the visibility perspective is that the
binding of a
higher-layer policy or a control objective needs to be tracked to the specific
low-layer
control primitives that the underlying network elements can be programmed
with.
Examples of the high-level control objectives include, but are not limited to:
Block user
X from accessing the network, Maintain high performance for Application Y,
Detect
and mitigate denial of service (DOS) attacks, and Prioritize user class Z
traffic.
For a manual/prescribed control, the control instructions that achieve a
high-level objective are computed and presented to the user, but not
automatically
programmed into the network elements. In addition, specific network elements
that
require a new or updated configuration based on the control instructions are
computed
as a function of network topology and presented to the user. The present
system
computes how the control is to be achieved in a distributed manner. The
control
instruction sets may be probabilistically ranked in the order of predicted
effectiveness.
While an explicit machine-to-machine programmability (e.g., controller) may
not be
required in some embodiments, it may be required for the present system to
discover
the configuration state and capabilities of the various network elements in
other
embodiments. The present system takes into account specific low-level control
primitives that the network elements can be configured with. For example, many

network elements have IP, MAC, and TCAM hardware tables of different sizes
that are
programmable with different primitives.
According to some embodiments, the disclosed system and method
dynamically tracks the bindings between a user and the network (IP address,
MAC
address, physical port) as a user changes devices, plugs into a different sub-
network,
and receives a new IP address from a dynamic host configuration protocol
(DHCP)
server. According to some embodiments, the present system and method binds an
application/network performance issue to specific traffic forwarding decisions
(e.g.,
application slowness is caused by a set of particular source/destination IP
address pairs
that are highly utilizing a particular link) or a network configuration (e.g.,
a
37

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
misconfigured maximum transmission unit (MTU)). According to some embodiments,

the present system and method ties a particular anomalous traffic behavior to
a specific
user/application/device, and further to particular IP/MAC addresses.
According to some embodiments, the present system and method takes
into account the topology and capabilities of the underlying network hardware.
For
example, if one is trying to use a pure layer 2 switch to enforce a user
policy, it would
be required to dynamically track the User ¨> MAC address binding, and use only
MAC
addresses for programming rules into the switch. An example of taking the
topology
into account, the present system and method tries to enforce a policy as close
to the
edge of the network as possible, which current firewalls, usually deployed
inline at
logical or physical network choke points, cannot do. The rules programmed to
the
network elements can be changed in a closed loop manner when the higher-layer
to
lower-layer bindings change.
FIG. 12 is a flow diagram of a control loop, according to one
.. embodiment. The control loop starts (at 1101), and the manager determines
whether
there are unsatisfied high-level control objectives (at 1102). The manager
branches off
based on a control method (at 1103). For a manual control method, the manager
computes the optimized lower-level rules and topologies to send to the network

controller base on, but not limited to, 1) the high-level control objective,
2) estimated
higher-layer bindings values and associated uncertainties, 3) configuration
capability
and current configuration of underlying network elements, and 4) other
information
such as network topology, statistics, tolerable configuration changes (at
1104). The
manager presents the control method of specific network elements to achieve
the high-
level control objective to the user (at 1105). For an automatic control, the
manager
computes the initial update control to the programmable network elements based
on,
but not limited to, 1) the high-level policies, problems, security
requirements,
anomalies, 2) estimated higher-layer parameter values and associated
uncertainties, 3)
configuration capability and current configuration of underlying network
elements, 4)
other information such as network topology, statistics, tolerable
configuration change,
5) measurement of effectiveness of the control policy, and 6) control loop
parameters
such as stability, oscillation, timescale (at 1106). The manager sends the
control policy
38

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
parameters to the programmable network elements (at 1107), and observes the
network
and measures effectiveness of the control policy with respect to the high-
level policy (at
1108).
As an example of manual/prescribed control, the present system and
method enforces a high-level objective of blocking user X from the network. To
do
this, the present system and method first derives the IP addresses that user X

corresponds to. Then, the present system and method computes a logical choke
point to
apply the policy effectively. For example, the logical choke point corresponds
to the
routers on the subnets of user X's IP address. The output of the present
system includes
a set of commands at each of the routers that results in the traffic from/to
those IP
addresses being dropped. An alternative output is a set of commands to a
controller to
implement a desired control.
For an automatic control, the present system and method programs the
network elements in a closed loop manner to achieve and maintain a high-level
control
objective. The automatic control is based on an inherent assumption that the
underlying
network has programmable network elements. In addition to the binding of
higher-
layer objectives to low-layer programmable primitives and taking into account
the
configuration state and capabilities of the underlying network elements, the
present
system and method computes a dynamic control loop. The present system and
method
first applies a possible control (e.g., a gain) and checks to see if a high-
level objective is
achieved. If so, the present system and method backs off the remediation
and/or applies
a different but lighter remediation and checks again to see if the high-level
objective is
still achieved. If not, the present system and method attempts to apply a
heavier control
and/or re-diagnose the higher-layer objective to low-layer control primitives
binding
and apply a different control. This procedure is also depicted in FIG. 12. The
first step
of the closed loop control may be different from the steps provided by the
manual
control. Additionally, factors such as stability, oscillation and timescale of
response
may be taken into account in the setup of the control loop.
The automatic closed loop control can be applied to the example of
blocking user X from the network. In this example, the present system and
method
programs rules to drop traffic from/to user X's IP address(es) at the routers
in the
39

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
network. Assuming that works, the present system and method tries to program
only
user X's default gateway router with a rule. If it fails, the present system
and method
applies more rules to other routers and/or blocks certain ports and continues.
When the
user X comes in on a new IP address(es), the present system and method
automatically
adjusts to the changed network topology.
Another use case of an automatic closed loop control is where the
control objective is to maintain high performance for application X. In this
case, the
present system and method simply programs rules that place all traffic
corresponding to
that application into the highest performing queue. If improved application X
.. performance is not observed, the present system and method attempts to
program rules
that re-route or rate-limit traffic from applications that share common
network links
with application X. If improvements are observed, the present system and
method
restores the performance of other applications.
An example of a higher-layer policy (for manual or automatic control) is
.. "Prioritize traffic from employees using business applications such as
Salesforce.com
or Workday, over casual traffic such as traffic from guest users using a
different set of
applications." To implement this higher-layer policy, the present system and
method
dynamically tracks the session 5-tuples for these combinations, and computes a
minimal
set of rules necessary for the enforcement, and dynamically tracks and
programs.
According to some embodiments, the present system and method
automatically provides remedies to network problems. For example, a user
enters in a
query of the form "user X is having problem Y with application Z," and the
present
system and method provides the top-ranked answer (i.e., the answer with
confidence
greater than a certain threshold) that "there is congestion on common network
links
caused by users using application W." If automatic remediation is enabled for
this
particular query, the manager sends instructions to the collector to command
the
controller to tell the appropriate network elements to (a) prioritize user X
¨> application
Z traffic over other traffic, or (b) disallow traffic involving application W.
The (b)
remediation approach may require additional policy permission from the
operator due
.. to the restrictive nature of the traffic disallowing policy.

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Referring to FIG. 5 as an example of the remediation process, suppose
that user X is "attached" to switch s3 and that application Z server is
"attached" to
switch s4. The policy to prioritize user X ¨> application Z traffic may be
applied by the
controller that sends rules to switch s3 that matches user X's IP address (as
source IP)
and the application server IP address (as destination IP), and has an action
that marks
the IP diffsery code point (DSCP) bits to represent the highest class of
service.
Similarly, the reverse rule is applied to switch s4 (i.e., with the source and
destination
IP addresses flipped).
Alternatively, the rules may be applied to all switches along the
.. communication path. These rules have similar match fields, but the action
field directly
sends the traffic to the highest priority queue. If the policy is to drop user
X ¨>
application Z traffic, the rules are applied to the edge switches s3 and s4,
respectively.
This is a useful technique since the rules do not need to be applied
everywhere in the
network.
Another example of the automated remediation process is in the
configuration domain. For example, for a query "there is a problem with
application
X," suppose that the top-ranked answer is "the problem appears to be that
switch Y is
dropping packets due to a misconfigured maximum transmission unit (MTU)
value."
The present system and method remediates this situation automatically by
sending
instructions to the collector to command the controller to reconfigure the MTU
value of
the appropriate switch.
According to some embodiments, one of the applications of turning
visibility into control is a full-fledged distributed firewall. For example,
the operator
sets up a policy "user X cannot access application Y," or "user X may be
barred from
the network for Y minutes after Z failed logon attempts." In another example,
the
operator sets up a policy to isolate (e.g., on a quarantine VLAN®) a user
whose
traffic exhibits malicious or anomalous behavior. The detection and manual or
automatic remediation of an anomaly (e.g., a detected DOS attack) can also be
addressed within the control framework of the present system and method.
FIG. 1 illustrates an exemplary computer architecture that may be used
for the present system, according to one embodiment. The exemplary computer
41

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
architecture may be used for implementing one or more components described in
the
present disclosure including, but not limited to, the present system. One
embodiment of
architecture 1200 includes a system bus 1201 for communicating information,
and a
processor 1202 coupled to bus 1001 for processing information. Architecture
1200
further includes a random access memory (RAM) or other dynamic storage device
1203
(referred to herein as main memory), coupled to bus 1201 for storing
information and
instructions to be executed by processor 1202. Main memory 1203 also may be
used
for storing temporary variables or other intermediate information during
execution of
instructions by processor 1202. Architecture 1200 may also include a read-only
memory (ROM) and/or other static storage device 1204 coupled to bus 1201 for
storing
static information and instructions used by processor 1202.
A data storage device 1205 such as a magnetic disk or optical disc and
its corresponding drive may also be coupled to architecture 1200 for storing
information
and instructions. Architecture 1200 can also be coupled to a second I/0 bus
1206 via
an I/0 interface 1207. A plurality of I/O devices may be coupled to I/O bus
1206,
including a display device 1208, an input device (e.g., an alphanumeric input
device
1209 and/or a cursor control device 1210).
The communication device 1211 allows for access to other computers
(e.g., servers or clients) via a network. The communication device 1211 may
include
one or more modems, network interface cards, wireless network interfaces or
other
interface devices, such as those used for coupling to Ethernet, token ring, or
other types
of networks.
FIG. 1 illustrates system architecture of an exemplary network incident
identification and analysis system 1300 deployed in an enterprise network,
according to
one embodiment. The system 1300 includes a manager 1310 that is located in a
cloud
server 1325. The cloud server may reside in a public or private cloud. The
system
1300 also includes one or more collectors 1320 that send data to the manager
1310. In
one embodiment, the collector 1320 is a software appliance (virtual or
physical) that is
located "on-premise" (i.e., on the premise of the enterprise network), and not
at a
remote Internet-linked location. In this embodiment, the collector is logic in
a non-
transitory computer readable memory that is executed by a processor to perform
the
42

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
actions described herein. In other embodiments, the collector is a combination
of
hardware and software. The collector receives live packets captured directly
from
physical and/or virtual network elements. The collector also receives data
(e.g.,
topology, statistics, user information, and the like) from other enterprise
systems
including identity management systems, network element controllers (e.g.,
controllers,
network management systems), and the like. In some embodiments, the collectors
1320
transmit the data obtained from physical and/or virtual network elements of
the
enterprise network directly to an analytics system 1330 of the system 1300
(located
remotely) for further examination. In some embodiments, the analytics system
1330 is
.. a sub-component of the manager 1310 and is located in the cloud server
1325. In some
embodiments, the manager 1310 in combination with the analytics system 1330
runs
the algorithms for identifying, analyzing, and remediating network incidents
from
cross-company real-time network data.
Identifying, Analyzing, and Remediating Network Incidents From Cross-Company
Real-Time Network Data:
Referring now to FIGS. 14-16, the network incident identification and
analysis system disclosed herein is directed towards (1) receiving real-time
network
data (at 1410) and executing performance calculations (at 1420), (2) computing
client
and/or infrastructure incidents from the real-time network data (at 1430), (3)
performing
outlier detection of the incidents (at 1440), (4) finding the root causes of
the incident
pattern (at 1450), (5) automatically determining (and implementing) a proper
remediation of the identified incident pattern (at 1460). Some embodiments of
the
network incident identification and analysis system also include aggregating
those
network incidents for a group of clients/infrastructure (at 1470), mapping the
network
incidents to group root causes (at 1480), and automatically determining and
implementing the proper group remediation at (at 1490). Additionally, some
embodiments of the network incident identification and analysis system include

detecting outlier subgroups within those network incidents (at 1485), finding
deviations
within those network incidents, prioritizing the deviations, and comparing the
network
incident across the current company and other "similar" companies. Moreover,
further
43

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
time-based analysis of the network incident may be performed using embodiments
of
the network incident identification and analysis system (at 1495).
The network incident identification and analysis system examines
multiple sources of real-time input network data, including: (1) metadata
gleaned from
deep packet analysis of real network traffic; (2) data from existing network
elements,
such as wireless controllers, access points, and the like (3) data from other
on-premise
systems, such as Microsoft Lync servers, and the like.
This combined real-time data is received into the network incident
identification and analysis system, where the data is time-aligned, such that
for every
time T (for individual clients), there is a record of client data including:
(1) Device/OS
information; (2) Wireless metrics, such as SNR, packet loss/retransmits,
connected
Access Point, Channel Utilization at the AP, and the like; (3) L2-L4
information such as
VLAN, receiver/transmitter bytes/packets, DSCP, TCP latency/retransmits, and
the
like; (4) Network services related L7 information such as DHCP, DNS, and
RADIUS
protocol info and response times; (5) L5-L7 information such as SSL
Certificate,
hostnames, HTTP response times; and (6) Application ID for detected
applications such
as Skype, SaaS applications, and the like.
In another aspect of the network incident system, the data is time-
aligned, such that for every time T (for access points and wireless LAN
controllers),
there is a record of wireless information such as: (1) Detected noise and
interference
levels on each channel; (2) Information about neighboring APs and topology;
(3)
CPU/Memory utilization, reboots, and the like.
In some embodiments of the network incident identification and analysis
system, where data is pulled from other on-premise systems, there could be
additional
.. client and/or infrastructure information for every time T, including: (1)
UC systems
(e.g., Microsoft Lync) call quality records including client MOS scores,
jitter, packet
loss, abnormal call terminations, and the like; (2) CPU/Memory utilization
information
from DHCP, DNS, RADIUS servers, as well as other network elements; and (3)
Netflow/Sflow/SNMP records from Routers, switches, and the like. NetFlow is a
network protocol for collecting IP traffic information and monitoring network
traffic to
analyze flow data and create a picture of network traffic flow and volume.
sFlow is a
44

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
network traffic measurement system technology that is embedded in a network
device
and provides continuous statistics on any protocol (L2, L3, L4, and up to L7),
thereby
enabling traffic throughout a network to be accurately characterized.
Referring now to pattern analysis of individual network incidents, in the
.. network incident identification and analysis system, a network incident is
defined as a
client, infrastructure network, or application issue that may occur in the
environment.
Examples of these kinds of issues, affecting an individual client or server or
network
element include: (1) "Client had poor Wi-Fi performance," (2) "Client could
not
connect due to RADIUS issues," (3) "Access point had more 5GHz Capable Clients
connecting on 2.4GHz;" (4) "RADIUS server had outages," (5) "DNS
Infrastructure
issues affected client," (6) "Client have service interruptions due to
excessive SSID
transitions," (7) "Client had poor Internet performance," and (8) "Client did
a
TCP/UDP Port Scan."
Notably, when analyzing these network incidents, the network incidents
often may be divided into "client type" incidents (i.e., incidents that
involve the
experience of a client), and "infrastructure type" incidents (i.e., incidents
that involve
the experience of an infrastructure element). Additionally, some network
incidents may
relate to a generic issue that is performance related (e.g., client could not
connect to the
network) or even security related (e.g., detected abnormal client behavior,
such as a port
scan).
For each of above described network incidences, the network incident
identification and analysis system performs a mathematical analysis that
involves
inspecting for a pattern of parameters that persist over a period of time. In
one example
of a performance metric that persists over time, this type of incident may be
detected by
first evaluating a time series performance metric X(t) that would be evaluated
on a
"time instant by time instant" basis, based on other metrics present at that
time instant.
Next, the incident is detected over a longer period of time T as the condition
of X(t) is
less than some threshold q for a certain proportion of T. In one embodiment,
the period
of time is 10 minutes. In other embodiments, the period of time may be 1, 2,
5, 20, or
30 minutes.

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
In a first example, the network incident identified is that the "Client had
poor Wi-Fi performance." In this scenario, X(t) represents Wi-Fi performance
based on
other metrics like SNR, L2 Retransmits/Packet loss, amount of data
transmitted, etc. In
a second example, the network incident identified is that the "Access point
had more
5GHz Capable Clients connecting on 2.4GHz." In this scenario, X(t) reflects an
indicator of the presence of the number of 5GHz capable clients connecting to
the
2.4GHz Access Point Radio exceeding by Q percent, the number of 5GHz capable
clients connecting to the 5GHz AP radio.
In another implementation, the network incident identification and
analysis system also performs a mathematical analysis that involves inspecting
for a
pattern in a sequence of metrics/events that exhibit over time. This type of
network
incident may be detected by looking for a sequence of events {E} that manifest
over a
time T. In a first example, the network incident identified is that the
"Clients have
service interruptions due to excessive SSID transitions." In this scenario {E}
would be
the sequence of at least M "back and forth" SSID transitions for a particular
client
within a period of time T. In this example, in "incident occurrence at time t"
is defined
as a network incident that was detected over the time period between t-T ¨> t.
Additionally, for any network incident there is a "group" version of the
"individual" incident (which is described in greater detail below). The
"group" version
of an incident refers to the number of affected clients/infrastructure out of
total number
of clients/infrastructure. For example, "100 out of 1000 clients had poor Wi-
Fi
performance" would imply that out of 1000 clients that were using the Wi-Fi
network,
100 of them had poor Wi-Fi performance.
Root cause analysis of individual network incidents, once a network
incident is detected for an individual client or infrastructure element by the
network
incident system and a pattern has been determined, the root cause is then
established by
measuring a vector of "symptoms" that manifested at the same time instances as
when
the incident was occurring. These "symptoms" are specific relevant factors
that explain
causation of why that incident was occurring.
In one implementation, {K} is the set of time instances (out of the total
period T) when the individual client or infrastructure element experienced the
incident.
46

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
Continuing in this implementation, {S} is the set of "symptoms" that are
relevant to the
particular incident in question. For each symptom s i in {S}, the network
incident
identification and analysis system evaluates the correlation of s i based on
the percent
of time instances K where the symptom manifested itself, as well as the
corresponding
values of the symptom at those instances. Additionally, {CS} is defined as the
vector
of correlation values for each symptom.
In the next portion of the root cause analysis, the {CS} vector is mapped
to a set of potential root causes based on the value of the correlations first

"thresholded," and then translated in a remediation database. An example of
this
process is provided below for a network incident in which the "Client X had
poor
Internet performance." In this example, the network incident identification
and analysis
system measures poor Internet performance by noting that HTTP response time
was
greater than "X" for over "Y" percent of the time "T" (e.g., see incident
calculation).
The system tracks measured symptoms {S} during time instances {K}
when the client was experiencing poor Internet performance:
(1) HTTP symptoms: HTTP response time greater than A, and the like.
(2) TCP symptoms: TCP latency greater than B, TCP retransmits greater
than C, TCP window size less than D, and the like.
(3) DNS symptoms: DNS Latency greater than E, DNS drops greater
than F, Unauthorized DNS Server used, and the like.
(4) Wi-Fi symptoms: SNR less than G, L2 packet loss/ retransmits
greater than H, Client roaming event, AP Noise greater than I, AP Channel
Utilization greater than J, and the like.
In this scenario, an example root cause result identified by the network
incident system is "poor Internet performance caused by Noise at Access
Point." This
root cause result was based on the factors (1) HTTP response time greater than
A, (2)
SNR less than G, and (3) AP Noise greater than I, which each manifested
themselves
over Z percent of the {K} bad time samples. Additionally, the amount that a
symptom
deviates from the normal/good value is also taken into account when deciding
its
contribution to a root cause result. In this manner, the set of potential root
causes is
47

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
aggregated and presented as the overall potential root causes for the specific
Incident
occurrence.
Referring now to group incident computation and group root cause
mapping, some embodiments of the network incident identification and analysis
system
also include ability to: aggregate network incidents for a group of
clients/infrastructure,
map the network incidents to group root causes, and automatically determine
and
implement the proper remediation. Regarding the grouping aspect of the system,
in one
implementation for a given time period T (once all of the individual client or

infrastructure network incidents are detected), all of the individual client
or
infrastructure network incidents that have occurred are categorized into a
"group
incident occurrence" (once all of the individual client or infrastructure
network
incidents are detected). An example of a "group incident occurrence" would be
"Between 10AM-11AM on 1/1/2016, 150 out of 1000 clients had poor Internet
performance." This group incidence occurrence says that during the time T
(i.e.,
10AM-11AM on 1/1/2016), 150 clients had the individual incident of having bad
Internet performance, whereas 850 other clients that were also using Internet
did not
have bad Internet performance.
Next, regarding the outlier analysis aspect of the system, the group
incident occurrence is analyzed for the presence of any "outlying subgroups."
An
outlying subgroup is determined by first partitioning the total number of
clients
according to some grouping (e.g., by AP-Group, AP, generic Location, device
type
and/or OS, radio channel, and the like). Next, particular sub-partitions are
identified as
an outlying subgroup if the percent of affected clients is "much higher" than
the overall
percent of affected clients and infrastructure. In this regard, the threshold
values for
"much higher" may be defined statistically, by assuming a binomial
distribution for the
number of clients being "affected" within any subgroup, and looking for the
probability
of a subgroup having a particular number of affected clients higher than X.
In one embodiment of the network incident identification and analysis
system, an example of an outlying network incident subgroup is a particular AP-
Group
X, during the same time T, having 120 out of 150 clients experience poor
Internet
performance. This outlying network incident subgroup implies that the percent
of
48

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
affected clients in AP group X is 80 percent, even though the overall percent
of affected
clients is only 15 percent. According to a binomial distribution, the
probability of this
occurring is infinitesimally small, and thus, AP group X would be flagged as
an
outlying subgroup within this incident.
Regarding the group root cause analysis aspect of the system, a
clustering analysis is implemented that uses correlation vectors of the
individual
affected clients (i.e., {CS} i for each client "i"). Accordingly, the network
incident
identification and analysis system can determine a distribution of systemic
root causes
that affects different groupings of the overall set of affected clients, by
first clustering
these affected clients and mapping them to a root cause together. Notably, the
examples discussed above describe a client group incident, although the same
analysis
would be performed for a group network incident with respect to infrastructure

elements (i.e., for infrastructure network incidents).
Regarding the prioritization aspect of the system, after a group incident
is detected, the network incident identification and analysis system then
prioritizes (in
terms of importance) the network incident. The prioritization is performed
based on
many factors that may include: (1) percent of clients affected by the
incident; (2)
presence of important users as part of this incident; (3) deviation from the
'intra-
company baseline' of percent of clients normally affected by this incident;
(4) deviation
from the 'inter-company baseline' of percent of clients normally affected by
this
incident; and (5) type of devices part of this incident.
The prioritization is performed by the network incident identification
and analysis system by using high-level metrics based around the percent and
type of
affected clients, as well as deviations of this percent from the baseline for
a given
incident. By contrast, tradition solutions have been known to prioritize via
thresholds
and deviations from individual raw metrics. For any incident that is above a
certain
priority level, an alert can immediately be sent to the user.
To determine the intra-company baseline deviation for an incident
occurrence, the following method is used: network incident identification and
analysis
system. In one exemplary embodiment, the current incident occurrence is at
time t (i.e.,
occurred between t-T and t), and involves X t out of Y t clients being
affected. The
49

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
baseline to be compared against involves examining the same incident type for
a time
period between t-Ti ¨> t-T2, where Ti is on the order of days or weeks (e.g.,
2 weeks),
and T2 would be a sufficient gap so that incident history is considered vs.
the current
incident occurrence itself (e.g., T2 > 2*T). Next, for each time i in {t-Ti ¨>
t-T2}, let
X i out of Y i clients be affected by this incident occurrence.
The baseline average of the percent of clients having poor performance
when approximately Y t clients are present is equal to the weighted average of
the (X _i
/ Y i) values weighted inversely proportional to 1Y i ¨ Yt. Similarly, the
baseline
standard deviation is a weighted standard deviation according to the same
weights. In
this way, deviations from the intra-company are deemed small or large based on
the
number of baseline standard deviations the current incident occurrence's
percent of
affected clients is from the baseline average. Notably, the intra-company
baseline may
be calculated for infrastructure based incidents in the exact same manner.
Specifically,
higher weight may be given to the historical incident occurrences when the
number of
active infrastructure elements (with respect to the specific incident type)
are
approximately the same as the current incident occurrence.
To determine the inter-company baseline deviation for an incident
occurrence, the following method is used: network incident identification and
analysis
system. In one exemplary embodiment, the first determination is the set of
"similar
companies" {C} with whom to consider part of the baseline. This may be
determined
using the following parameters: (1) Vertical (e.g., higher education,
hospital, high-tech
enterprise, and the like); (2) Size of environment; (3) Location (i.e., urban
vs. rural);
and (4) Device mix.
Once the set of similar companies is determined by the network incident
identification and analysis system, the inter-company baseline can be
calculated in a
similar fashion to the intra-company baseline. Using a historical time t-T3 ¨>
t-T2, for
each time i in {t-T3 ¨> t-T2}, let X c i out of Y c i be clients have had poor
Wi-Fi
performance for company c. The equivalent intra-company baseline algorithm can
be
used with all of these data points versus the ones in the same company.
In another aspect, the network incident identification and analysis system
also determines cross-company factor comparisons for network incidents. The
factors

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
help identify the differences in the cross-company environment that lead to
differences
in the baseline performance. In one example of a cross-company factor
comparison
using the network incident identification and analysis system, a set of
network incidents
are identified over a time period ti ¨> t2. In this example, the reference
number of
clients/infrastructure is fixed relevant to that incident at either an
absolute number N or
percentile P. For any incident type, a static list of "influencing factors" is
defined. In a
first example network incident, it was determined that "Clients had poor Wi-Fi

performance." The factors in this first example network incident include: (1)
AP-
Models; (2) Controller software version; (3) Channel widths; and (4) AP
degree.
In a second example network incident, it was determined that "Client
could not connect due to RADIUS issues." The factors in this second example
network
incident include: (1) Controller software version; (2) Controller RADIUS
configuration;
(3) RADIUS protocols in use; and (4) RADIUS server type.
For each network incident type, over the time period between ti ¨> t2,
the network incident identification and analysis system performs a cross-
company
analysis among "similar companies" to compare the factors of this company to
the best
performing and worst performing companies, from a client perspective for a
client
incident type, and from an infrastructure perspective for an infrastructure
incident type.
In this process, a client incident type and an absolute reference number
of clients N are assumed and the following steps are performed: (1) For each
company,
find the intra-company baseline for that network incident with respect to
total number
of clients equal to N. (2) For each company, for the R most relevant network
incidents
that contributed to the intra-company baseline, aggregate the client
distribution of the
comparison factors (e.g., percentage of clients that connected to AP-model X,
controller
software version number, and the like). (3) Rank the intra company baselines,
and pick
out the top X and bottom Y companies; for each of these companies show the
client
distributions of the comparative factors.
Notably, a similar analysis can be performed for a reference percentile P,
except that the set of relevant network incidents for the baseline of each
company is
simply the set of network incidents whose total number of clients or
infrastructure is
close to the percentile P over the time period between ti ¨> t2.
51

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
In another aspect, the network incident identification and analysis system
also performs long run root cause analysis. In one example of a long run root
cause
analysis using the network incident identification and analysis system, a set
of network
incidents are identified over a time period ti ¨> t2. In this example, the
reference
number of clients/infrastructure is fixed relevant to that incident at either
an absolute
number N or percentile P. To determine the long run root cause, the relevant
incident
occurrences are selected (as described in above for N or P), and for each
network
incident the client/infrastructure potential root cause distributions are
aggregated to
determine the overall root cause distribution.
In still another aspect, the network incident identification and analysis
system also performs outlying subgroup analysis over a longer time period.
Again, the
reference number of clients/infrastructure is fixed relevant to that incident
at either an
absolute number N or percentile P. In performing the outlying subgroup
analysis, the
relevant incident occurrences are selected (as described in above for N or P).
Next, the
.. network incident identification and analysis system identifies which
subgroups manifest
themselves the most as repeated outliers. This may be performed by summing,
for
every relevant incident occurrence where subgroup G was an outlying subgroup,
the
value X g/Y g, where this value represents the percent of affected clients
within that
subgroup. The outlying subgroup with the maximum value for this metric is the
outlying subgroup determined to be the most outlying over time.
In yet another aspect, the network incident identification and analysis
system also performs causal analysis. In the performance of causal analysis,
the first
step is to compute the intra-company baseline. Next, the intra-company
baseline is
overlaid with detected changes in related factors, general configuration, and
the like.
Additionally, the network incident identification and analysis system can
overlay
manually annotated relevant important events. This overlaid baseline graph is
analyzed
to automatically detect if a change in the intra-company baseline coincides
with the any
automatic or manually annotated changes. If a change in the intra-company
baseline
does coincide with any automatic or manually annotated changes, then one or
more
causalities between an event and the baseline may be determined.
52

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
FIG. 17A illustrates a computer screenshot displaying a client network
incident for a network device. Specifically, the computer screenshot shows the

troubleshooting of a client device that could not connect due to DNS issues.
Potential
root causes were identified as well as potential remediation steps. Potential
remediation
steps may include network configuration changes and/or network upgrades to
increase
capacity.
FIG. 17B illustrates a computer screenshot displaying a client network
incident for a wireless network device. Specifically, the computer screenshot
shows the
troubleshooting of a client device that experiences poor Skype performance.
Potential
root causes were identified as well as analysis of the symptoms including: TCP
latency,
TCP retransmit, Wi-Fi performance, SNR, L2 Rx Retransmit, Radio Channel Busy,
AP
Noise, Neighbor Interference, and AP Roam.
FIG. 17C illustrates a computer screenshot displaying a group
infrastructure network incident. Specifically, the computer screenshot shows
the
troubleshooting of a DNS infrastructure issue that is affecting clients. An
incident
summary is displayed as well as a baseline comparison.
FIG. 17D illustrates a computer screenshot displaying a group network
incident for wireless network devices. Specifically, the computer screenshot
shows the
troubleshooting of clients having poor Wi-Fi performance. An incident summary
is
displayed as well as an Outlying Subgroups analysis.
FIG. 17E illustrates a computer screenshot displaying a group network
incident for wireless network device, as well as a root causes analysis. An
analysis of
the symptoms was performed, including: L2 Rx Retransmit, AP Noise, Radio
Channel
Busy, and AP Roam. Grouping by client properties was also performed.
.. Computing Client Congestion Metrics From Cross-Company Real-Time Network
Data:
Referring now to FIG. 18, other embodiments of the systems and
methods disclosed herein are directed towards computing client congestion
metrics
from real-time network data. Additionally, some embodiments of the disclosed
network congestion analysis and management system and method include finding
outlying subgroups for client congestion, comparing the client congestion
metrics
53

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
across the current company and other "similar" companies. Moreover, further
time-
based analysis of the congestion metrics may be performed using embodiments of
the
system.
As shown in FIG. 18, the congestion analysis and management system
.. and method may be used to perform real-time calculations of a single
congestion. At
step 1810, the system computes whether a time is a "peak usage time" for a
"service."
At Peak Usage time (step 1820), the system computes a client-centric view of
congestion for that service (e.g., for RADIUS, the percent of clients that see
a latency
greater than X; for Wi-Fi, the percent of clients that see channel utilization
greater than
X). At step 1830, a Partial Incident Pipeline is shown.
Also shown in FIG. 18, the congestion analysis and management system
and method may be used to perform advanced congestion analysis of over time.
At step
1840, the system identifies many of Congestion KPIs (Key Performance
Indicators)
over time. At step 1850, the system determines baseline congestion from
numerical
.. analysis. Next, at step 1860, the system performs persistent location and
outlier
analysis by analyzing outlying subgroup incidences. At step 1870, the system
aggregates statistics of this incidence and compares factors to other
"similar"
environments. At step 1880, the system computes the relationship between the
congestion and the incidences. Then at step 1890, the system predicts future
incidences
.. based current congestion KPI values.
As described above, the disclosed network congestion analysis and
management system and method examine multiple sources of real-time input
network
data, including: (1) metadata gleaned from deep packet analysis of real
network traffic;
(2) data from existing network elements, such as wireless controllers, access
points, and
.. the like (3) data from other on-premise systems, such as Microsoft Lync
servers, and
the like.
This combined real-time data is received into the network congestion
analysis and management system, where the data is time-aligned, such that for
every
time T (for individual clients), there is a record of client data including:
(1) Device/OS
.. information; (2) Wireless metrics, such as SNR, packet loss/retransmits,
connected
Access Point, Channel Utilization at the AP, and the like; (3) L2-L4
information such as
54

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
VLAN, receiver/transmitter bytes/packets, DSCP, TCP latency/retransmits, and
the
like; (4) Network services related L7 information such as DHCP, DNS, and
RADIUS
protocol info and response times; (5) L5-L7 information such as SSL
Certificate,
hostnames, HTTP response times; and (6) Application ID for detected
applications such
as Skype, SaaS applications, and the like.
In another aspect of the network congestion analysis and management
system, the data is time-aligned, such that for every time T (for access
points and
wireless LAN controllers), there is a record of wireless information such as:
(1)
Detected noise and interference levels on each channel; (2) Information about
neighboring APs and topology; (3) CPU/Memory utilization, reboots, and the
like.
In some embodiments of the network congestion analysis and
management system, where data is pulled from other on-premise systems, there
could
be additional client and/or infrastructure information for every time T,
including: (1)
UC systems (e.g., Microsoft Lync) call quality records including client MOS
scores,
jitter, packet loss, abnormal call terminations, and the like; (2) CPU/Memory
utilization
information from DHCP, DNS, RADIUS servers, as well as other network elements;

and (3) Netflow/Sflow/SNMP records from Routers, switches, and the like.
NetFlow is
a network protocol for collecting IP traffic information and monitoring
network traffic
to analyze flow data and create a picture of network traffic flow and volume.
sFlow is a
network traffic measurement system technology that is embedded in a network
device
and provides continuous statistics on any protocol (L2, L3, L4, and up to L7),
thereby
enabling traffic throughout a network to be accurately characterized.
In some embodiments of the disclosed here network congestion analysis
and management system and method, a client congestion metric for a given
service X is
.. defined as: "At 'peak usage' of service X, what percentage of clients face
congestion
that are due to the infrastructure?" In this scenario, a Peak Usage of a
service over a
time period between Ti and T2 is defined below. First, the overall time period
[Ti, T2]
is sliced into even periods of time length T. An example would be [Ti, T2]
representing a week, and T representing 1 hour.
Continuing, in this scenario, the "usage" during a particular time is the
number of clients that are accessing that service during the time T. For
example, for

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
DHCP, the usage refers to the number of client devices that requested an IP
address
from a DHCP server over that time. In other embodiments, the usage refers to a
more
granular usage metric such as bandwidth, number of transactions, number of
roams, or
(for Internet access) RADIUS and Wi-Fi, respectively.
In the scenario described above, the term "peak" is defined as "at or
exceeding" a certain usage percentile. In one example, the peak hours of
RADIUS
usage over a two-week period is defined as the hours in which "usage" exceeded
the
90th percentile.
Finally, in this scenario, a client is congested for a service if, during the
peak time T, the client faces "congestion" due to the infrastructure. In one
example,
with respect to DHCP, congestion due to infrastructure refers to a client
facing an
average transaction time greater than X for over Y percent of the time. With
respect to
Internet access, this may refer to a client whose underlying Wi-Fi
connectivity is strong,
but the client may suffer from an average HTTP response time that is greater
than Z for
over Y percent of the time. With respect to Wi-Fi, the congestion due to
infrastructure
may refer to a client connecting to an access point radio with channel
utilization greater
than A for over Y percent of the time, or over K number of other clients
connected for
over Y percent of the time.
Referring now to outlying subgroup analysis, in some embodiments of
the disclosed network congestion analysis and management system and method,
all of
the clients facing congestion are grouped together for a given (peak) time
period T. In
one example, this congestion may be described as "Between peak time 10:00am-
11:00am on January 1,2016, 150 out of 1000 clients had faced Internet
congestion."
This peak congestion occurrence says that during the time T (i.e., 10am-11 am
on
January 1, 2016), 150 clients faced Internet congestion, whereas 850 other
clients that
were also using Internet did not experience this congestion.
Continuing, in this scenario the congestion occurrence is analyzed for
the presence of any "outlying subgroups." An outlying subgroup may be
identified as
follows: (1) Partition the total number of clients according to some grouping
(e.g., by
AP-Group, AP, generic Location, device type and/or OS, radio channel, and the
like);
(2) Identify particular sub-partitions as an outlying subgroup if the percent
of affected
56

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
clients is "much higher" than the overall percent of affected
clients/infrastructure; and
(3) Defined "Much higher" statistically by assuming a binomial distribution
for the
number of client being "affected" within any subgroup, and looking for the
probability
of a subgroup having a particular number of affected clients higher than X.
Referring now to an example of an outlying subgroup from the previous
embodiment, a particular AP-Group X, during the same time T, had 120 out of
150
clients with poor Internet performance. This data implies that the percent of
affected
clients in AP group X is 80 percent even though the overall percent of
affected clients is
only 15 percent. According to a binomial distribution, the probability of this
occurring
is infinitesimally small, and thus AP group X would be flagged as an outlying
subgroup
within this congestion occurrence.
Referring now to intra-company and inter-company trend calculation,
since congestion metrics are only calculated during "peak times," the intra-
company
trend is simply the time series of these values.
To determine the inter-company congestion trend, the first step is to
identify the set of "similar companies" {C} to be considered part of the
baseline. The
disclosed network congestion analysis and management systems and methods
perform
this process using parameters that include: Vertical (e.g., higher education,
hospital,
high-tech enterprise, etc.); Size of environment; Location (i.e., urban vs.
rural); Device
mix; and the like. Once the set of similar companies is determined, the inter-
company
baseline may be calculated in a similar fashion to the intra-company baseline.
Referring now to cross-company factor comparison for congestion
metrics, the disclosed network congestion analysis and management systems and
method calculates a set of congestion metrics for time instances over a time
period ti ¨>
t2, (per the explanation of the previous section). Continuing, the disclosed
network
congestion analysis and management systems and method employs a list of
"influencing factors." Multiple example scenarios are provided below. In the
first
example, Wi-Fi Congestion Metric, the influencing factors include: AP-Models;
Controller software version; Channel widths; AP degree, and the like. In the
second
example, RADIUS Congestion Metric, the influencing factors include: Controller
57

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
software version; Controller RADIUS configuration; RADIUS protocols in use;
RADIUS server type, and the like.
For each congestion metric, the disclosed network congestion analysis
and management systems and method employs a cross-company analysis among
"similar companies" (explained in the previous section) over the time period
between ti
¨> t2 to compare the factors of this company to the best performing and worst
performing companies. To perform this analysis, the following steps are
performed for
a specific congestion metric: (1) For each company, find the intra-company
trend for
that congestion metric; (2) For each company, for the R most relevant times
that
contributed to the intra-company trend, aggregate the client distribution of
the
comparison factors (e.g., percent of clients that connected to AP-model X,
controller
software version X, and the like); and (3) Rank the intra company trends, and
pick out
the top X and bottom Y companies. For each of these companies, the client
distributions of the comparative factors are displayed.
Referring now to long run outlying subgroup analysis and causal
analysis, the disclosed network congestion analysis and management systems and

methods employ a set of congestion metrics for time instances over a time
period ti ¨>
t2, (per the explanation of the previous section).
With respect to the long run outlying subgroup analysis, the first step
performed by the disclosed systems and methods is to identify the time
instances over
which the peak congestion metric was calculated (as previously discussed).
Next, the
disclosed network congestion analysis and management systems and methods
examine
which subgroups manifest themselves the most as repeated outliers over time.
One way
to achieve this information is to simply sum, for every relevant peak
congestion
occurrence where subgroup G was an outlying subgroup, the value X g/Y g, where
this
represents the percentage of affected clients within that subgroup.
Accordingly, the
outlying subgroup with the maximum value for this metric is the outlying
subgroup
deemed most outlying over time.
With respect to causal analysis, the first step performed by the disclosed
network congestion analysis and management systems and methods is to compute
the
intra-company baseline. Next, the disclosed systems and methods overlay this
baseline
58

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
with detected changes in related factors, general configuration, and the like.

Additionally, the disclosed network congestion analysis and management systems
and
methods may overlay annotated relevant important events. This overlaid
baseline graph
may be analyzed by disclosed systems and methods to automatically detect if a
change
in the intra-company baseline coincides with any automatic or manually
annotated
changes. If such baseline coincides have occurred, then a causality between an
event
and the baseline may be determined.
Notably, with respect to all of the previous sections, the above-described
steps do not all need to be performed in any embodiment. In some embodiments,
less
than all of these steps are performed. Additionally, in other embodiments,
some of the
above-described steps may be performed out of order.
Using Real-Time Packet Data in Conjunction With Other Sources to Detect
Wired/Wireless Network Issues:
Referring now to FIG. 19, other embodiments of the systems and
methods disclosed herein are directed towards extracting information from real-
time
network packet data. This information is then combined with data from other
sources
to detect client Connectivity & Roaming Network issues. The disclosed network
information extraction systems and methods examine two key sources of data:
Real
client network traffic and Data from Wireless Controllers.
The "crawler" component of the disclosed network information
extraction system and method examines network traffic and is responsible for
extracting
information from client network traffic. This extracted information from
client network
traffic enables the rest of the disclosed network information extraction
systems and
methods to detect connectivity and roaming issues. Note that the "crawler" can
perform
the role of what is referred to as a "loupe collector", and may collect
additional
information or a different set of information.
With respect to connectivity issues, in some embodiments the crawler
performs the following steps for every client device X in the network. These
steps
include: (1) Detecting when X initiates an "on-the-wire" connectivity event,
by looking
for either a RADIUS request, DHCP discover; or DHCP request packet; (2) After
59

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
detecting an "on-the-wire" connectivity event, the Crawler waiting Y minutes
for the
Client to accomplish an "Internet event" (e.g., TCP or UDP traffic transmitted
either to
a public IP address, or to a different subnet); (3) At the end of Y minutes
(or sooner in
the case of an Internet event) the crawler recording the following
information: (A)
RADIUS last state (e.g., failed, success, timeout, and the like); (B) DHCP
last state
(e.g., Offer, Acknowledgement, No Acknowledgement, timeout, and the like); (C)
ARP
last state with respect to default gateway (response, timeout, and the like);
and (D) DNS
response (response, timeout, and the like); and (4) The crawler packaging this

information as "connectivity event information" and sending this information
to the
analytics section 1330 of the disclosed network information extraction systems
and
methods.
Referring again to FIG. 19, the network information extraction systems
and methods may be used to detect connectivity issues in an aspect of one
embodiment.
At step 1910, the Crawler component of the system, via deep packet inspection,
detects
connectivity event for device on the wire (e.g., RADIUS or DHCP). Then at step
1920,
the Crawler component of the system gives a client device X minutes to go
through the
connectivity process: RAD 4 DHCP 4 ARP 4 DNS 4 "WEB." The failure of
success of the process is marked, and result is stored. Next at step 1930,
additional
event correlation (e.g., with wireless association) is performed at the
analytics section
1330 of the disclosed network information extraction system and method. At
step
1940, the resulting "event data" is incorporated into the
INCIDENT/ADVISORY/CAPACITY PIPELINE.
With respect to roaming issues, in some embodiments the crawler
performs the following steps for every wireless client device X in the
network. These
steps include: (1) Detecting when X initiates an "on-the-wire" association
event, by
looking for a RADIUS request with metadata indicating association with a
particular
access point, S SID, and the like; (2) Buffer statistics about traffic such
that given an
"on-the-wire" association event, the crawler determines: (A) If there was
traffic before
the event, the total time, Ti, between the last traffic packet before and the
first traffic
packet after the event; (B) If there was TCP traffic before the event, the
presence of
high percent of TCP retransmitted packets, R1, before and/or after the event;
and (C) If

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
there was TCP traffic before the event, the presence of a high percentage of
TCP reset
and synchronize packets, R2, before and/or after the event; and (3) The
crawler
packaging this information as "connectivity event information" and sending
this
information to the analytics section 1330 of the disclosed network information
extraction systems and methods.
Referring once again to FIG. 19, the network information extraction
systems and methods may be used to detect roaming issues in an aspect of one
embodiment. At step 1950, the Crawler component of the system, via deep packet

inspection, detects re-association / roaming by looking for RADIUS events. At
step
1960, the crawler component of the system tracks data transmission pause time
and
TCP performance before and after with respect to retransmissions and the
presence of
RST (reset) /FIN (finish) packets. Next, at step 1970, additional event
correlation (e.g.,
with wireless association) is performed at the analytics section 1330 of the
disclosed
network information extraction system and method. At step 1980, the resulting
"event
data" is incorporated into the INCIDENT/ADVISORY/CAPACITY PIPELINE.
Continuing, with respect to roaming issues without "on-the-wire"
association events (e.g., in systems where "fast roaming" is enabled), for
each wireless
client X in the network, the crawler computes "roaming event information"
based on
short, sliding window periods of time T (i.e., with T being on the order of
seconds or
tens of seconds). The "sliding window roaming event information" blocks are
sent to
the analytics section 1330 of the disclosed network information extraction
systems and
methods.
Notably, in addition to the basic information itself, the crawler also
sends additional metadata (e.g., associated access point, RADIUS user name,
number of
failures/timeouts, and the like) to the analytics section 1330 described above
so that the
disclosed network information extraction systems and methods may disambiguate
a root
cause, as well as determine other detailed evidence regarding the detected
issues.
The analytics section 1330 of the disclosed systems and methods
receives the previously mentioned data from the crawler. With respect to
connectivity
.. issues, the analytics section 1330 of the disclosed systems and methods may
transform
the connectivity event information into indications of a "connectivity
incident." In one
61

CA 03020591 2018-10-10
WO 2017/184627 PCT/US2017/028173
embodiment, if the last state of the connectivity event information was a
"RADIUS
Reject," then the system may indicate a network incident such as "Client could
not
connect due to RADIUS Issues." This network incident may then be passed
through
the rest of the analytics pipeline described above.
With respect to roaming issues based on "on-the-wire association"
events, the analytics section 1330 of the disclosed systems and methods may
also
simply transform the roaming event information into indication of a "roaming
incident."
In one embodiment, if Ti, R1, R2 exceeded pre-defined thresholds, the system
indicates
a network incident such as "Client had roaming issues." This network incident
may
.. then be passed through the rest of the analytics pipeline of the disclosed
systems and
methods as described above.
With respect to roaming issues without "on-the-wire" association events,
the analytics section 1330 of the disclosed systems and methods employ a
separate
source of wireless roaming events for each client X, which can be gleaned from
WLAN
controllers, Access Points, and the like. For each wireless roaming event for
client X at
time t, the analytics section 1330 of the disclosed systems and methods checks
for a
"sliding window roaming event information" that includes time T, and
transforms the
resultant information into an indication of a "roaming incident" described
above. This
network incident may then be passed through the rest of the analytics pipeline
of the
disclosed systems and methods as described above.
The foregoing description, for purposes of explanation, uses specific
nomenclature and formula to provide a thorough understanding of the disclosed
embodiments. It should be apparent to those of skill in the art that the
specific details
are not required in order to practice the invention. The embodiments have been
chosen
and described to best explain the principles of the disclosed embodiments and
its
practical application, thereby enabling others of skill in the art to utilize
the disclosed
embodiments, and various embodiments with various modifications as are suited
to the
particular use contemplated. Thus, the foregoing disclosure is not intended to
be
exhaustive or to limit the invention to the precise forms disclosed, and those
of skill in
the art recognize that many modifications and variations are possible in view
of the
above teachings.
62

CA 03020591 2018-10-10
WO 2017/184627
PCT/US2017/028173
While various embodiments have been described above, it should be
understood that they have been presented by way of example only, and not
limitation.
Thus, the breadth and scope of a disclosed embodiment should not be limited by
any of
the above-described exemplary embodiments, but should be defined only in
accordance
with the following claims and their equivalents.
63

Representative Drawing

A single figure which represents the drawing illustrating the invention.

Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee and Payment History should be consulted.

Administrative Status

Title	Date
Forecasted Issue Date	2024-02-20
(86) PCT Filing Date	2017-04-18
(87) PCT Publication Date	2017-10-26
(85) National Entry	2018-10-10
Examination Requested	2020-04-27
(45) Issued	2024-02-20

Abandonment History

There is no abandonment history.

Maintenance Fee

Last Payment of $277.00 was received on 2024-03-20

Upcoming maintenance fee amounts

Description	Date	Amount
Next Payment if small entity fee	2025-04-22	$100.00
Next Payment if standard fee	2025-04-22	$277.00

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

the reinstatement fee;
the late payment fee; or
additional fee to reverse deemed expiry.

Patent fees are adjusted on the 1st of January every year. The amounts above are the current amounts if received by December 31 of the current year.
Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Payment History

Fee Type	Anniversary Year	Due Date	Amount Paid	Paid Date
Application Fee			$400.00	2018-10-10
Maintenance Fee - Application - New Act	2	2019-04-18	$100.00	2019-04-03
Maintenance Fee - Application - New Act	3	2020-04-20	$100.00	2020-04-01
Request for Examination		2022-04-19	$800.00	2020-04-27
Registration of a document - section 124		2020-06-02	$100.00	2020-06-02
Registration of a document - section 124		2020-06-02	$100.00	2020-06-02
Maintenance Fee - Application - New Act	4	2021-04-19	$100.00	2021-03-22
Maintenance Fee - Application - New Act	5	2022-04-19	$203.59	2022-03-22
Maintenance Fee - Application - New Act	6	2023-04-18	$210.51	2023-03-22
Final Fee			$306.00	2023-12-29
Registration of a document - section 124			$125.00	2024-03-18
Maintenance Fee - Patent - New Act	7	2024-04-18	$277.00	2024-03-20

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
VMWARE LLC

Past Owners on Record
NYANSA, INC.
VMWARE, INC.

Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.

Documents

To view selected files, please enter reCAPTCHA code :

To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Filter

Download Selected in PDF format (Zip Archive)

Download Selected as Single PDF

Document Description	Date (yyyy-mm-dd)	Number of pages	Size of Image (KB)
Change of Agent	2020-03-11	2	104
Office Letter	2020-04-02	2	214
Office Letter	2020-04-02	1	208
Request for Examination	2020-04-27	1	56
Examiner Requisition	2021-06-02	3	156
Amendment	2021-09-30	16	514
Amendment	2021-10-06	16	614
Claims	2021-09-30	6	183
Claims	2021-10-06	6	220
Examiner Requisition	2022-03-15	4	185
Amendment	2022-07-18	12	437
Amendment	2022-07-13	11	343
Claims	2022-07-13	6	338
Claims	2022-07-18	6	300
Examiner Requisition	2023-01-19	4	180
Amendment	2023-05-15	16	544
Amendment	2023-05-15	17	685
Claims	2023-05-15	6	319
Claims	2023-05-16	6	331
Abstract	2018-10-10	2	91
Claims	2018-10-10	14	587
Drawings	2018-10-10	24	931
Description	2018-10-10	63	3,263
International Search Report	2018-10-10	5	202
Declaration	2018-10-10	4	71
National Entry Request	2018-10-10	4	114
Representative Drawing	2018-10-19	1	20
Cover Page	2018-10-19	1	56
Final Fee / Completion Fee - PCT	2023-12-29	1	65
Representative Drawing	2024-01-23	1	21
Cover Page	2024-01-23	1	60
Electronic Grant Certificate	2024-02-20	1	2,528

Language selection

Menus

Patent 3020591 Summary

English Abstract

French Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.

Patent 3020591 Summary

English Abstract

French Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.Requested information will be availablein a moment.Thank you for waiting.

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.