Patent 3140769 Summary

(12) Patent Application:	(11) CA 3140769
(54) English Title:	METHOD AND SYSTEM FOR POSITIONING FAULT ROOT CAUSE OF SERVICE SYSTEM
(54) French Title:	METHODE ET SYSTEME DE LOCALISATION DE LA CAUSE PRINCIPALE D'UNE DEFAILLANCE DANS UN SYSTEME DE SERVICE
Status:	Examination Requested

Bibliographic Data

(51) International Patent Classification (IPC):	H04L 41/0631 (2022.01) G06F 11/00 (2006.01)
(72) Inventors :	ZHAI, XUEPENG (China) BAO, YUXUE (China) GENG, ZHILIANG (China)
(73) Owners :	10353744 CANADA LTD. (Canada)
(71) Applicants :	10353744 CANADA LTD. (Canada)
(74) Agent:	HINTON, JAMES W.
(74) Associate agent:
(45) Issued:
(22) Filed Date:	2021-11-30
(41) Open to Public Inspection:	2022-05-30
Examination requested:	2022-09-16
Availability of licence:	N/A
(25) Language of filing:	English

Patent Cooperation Treaty (PCT):	No

(30) Application Priority Data:

Application No.	Country/Territory	Date
202011376566.6	China	2020-11-30

Abstracts

English Abstract

The present invention discloses to locate root cause of business system
failure method and system. The
method comprises: calculation engine obtains call-chain messages, sampling
call-chain messages according
to sampling ratio, grouping and assembling to obtain call-chain, obtaining
time consumption of each service
interface in the call-chain, filtering out failure service interface according
to time consumption comparison
table; calculation engine obtains network element indicator messages,
calculating first variation range of
network element indicator, filtering out failure network element indicator
according to first variation range
threshold value; calculation engine correlates failure service interface with
failure network element
indicator according to time and issues failure warning, failure warning
comprises failure service interface
and failure network element indicator which causes the failure service
interface to fail. The system quickly
locates root cause of business system failure by combining call-chain log file
and network element indicator
data through above-mentioned method.

Claims

Note: Claims are shown in the official language in which they were submitted.

Claims:
1. A method for locating root cause of business system failure comprises:
calculation engine obtains call-chain messages, sampling the call-chain
messages according to
sampling ratio, grouping and assembling to obtain the call-chain, obtaining
time consumption of
each service interface in the call-chain, and filtering out failure service
interface according to time
consumption comparison table;
calculation engine obtains network element indicator messages, calculating
first variation range of
the network element indicator, filtering out failure network element indicator
according to the first
variation range threshold value; and
calculation engine correlates the failure service interface with the failure
network element indicator
according to time and issues failure warning, the failure warning comprises
failure service interface
and failure network element indicator, wherein the failure network element
indicator causes the
failure service interface to fail.
2. The method for locating root cause of business system failure according to
claim 1, wherein, the method
of calculation engine obtains call-chain messages and network element
indicator messages comprises:
network element periodically generates network element indicator messages and
stores in indicator
log file;
generating call-chain messages when service interface deployed by network
element is called and
storing in call-chain unit log file;
network element uses log file collection module to push newly added network
element indicator
23
Date recue / Date received 2021-11-30

messages of the indicator log file and newly added call-chain messages of the
call-chain unit log
file to distributed publish and subscribe message system; and
calculation engine reads call-chain messages and network element indicator
messages from the
distributed publish and subscribe message system.
3. The method for locating root cause of business system failure according to
claim 1, wherein, the method
of calculation engine filters out the failure service interface comprises:
obtaining and parsing the call-chain messages to get call-chain ID, service
interface name
identification and event corresponding to call-chain message, meanwhile,
obtaining call-chain
entry message;
configuring sampling ratio of the call-chain messages, obtaining call-chain
entry message sample
according to the sampling ratio;
according to call-chain ID and event time to filter out call-chain messages
belonging to the same
group as the call-chain entry message sample, assembling all call-chain
message of the same group
to get call-chain information; and
obtaining time consumption of each service interface in call-chain and
filtering out failure service
interface according to time consumption comparison table.
4. The method for locating root cause of business system failure according to
claim 3, wherein, the method
of parsing the call-chain messages to get call-chain entry message comprises:
setting expiration time of the call-chain messages, filtering and parsing
unexpired call chain
24
Date recue / Date received 2021-11-30

messages according to the expiration time, obtaining call-chain ID, service
interface name
identification and event time of the call-chain messages;
filtering out call-chain entry message in the call-chain messages according to
call-chain ID and
service interface name identification.
5. The method for locating root cause of business system failure according to
claim 3, wherein, the method
of filtering out failure service interface according to time consumption
comparison table comprises:
pre-calculating average time consumption corresponding to service interface,
storing service
interface name identification of the service interface and the average time in
the time consumption
comparison table with one-to-one correspondence;
obtaining service interface in call-chain, wherein the service interface takes
more time than the
average time in the time consumption comparison table, storing the service
interface as alternative
failure service interface; and
calculating failure credibility of alternative failure service interface
according to actual time and
correspondingly average time of the alternative failure service interface,
filtering out alternative
failure service interface with greater failure credibility than pre-set
credibility threshold and storing
as alternative failure service interface.
6. The method for locating root cause of business system failure according to
claim 1, wherein, the method
of calculation engine filters out failure network element comprises:
pre-calculating sample mean and stability of network element, and
correspondingly storing
indicator name identification, stability and sample mean of network element in
stability comparison
table;
Date recue / Date received 2021-11-30

obtaining and parsing network element indicator messages, obtaining indicator
name identification
and current value of the network element indicator;
searching for correspondingly sample mean and stability from the stability
comparison table
according to the indicator name identification; and
according to the current value of network indicator and the correspondingly
sample mean and
stability, calculating first variant range of network element indicator,
storing the network element
indicator with greater first variant range than pre-set first variant
threshold as failure network
element indicator.
7. The method for locating root cause of business system failure according to
claim 6, the method of pre-
calculating sample mean and stability of network element comprises:
collecting x network element indicator samples;
calculating mean value of the network element indicator samples and second
variant range of each
network element indicator sample, counting quantity n of failure network
element indicator samples
with greater second variant range than pre-set second variant threshold; and
calculating stability m of network element indicator, wherein, m = x:n.
8. The method for locating root cause of business system failure according to
claim 6 or 7, the method of
calculation engine filters out failure network element indicator also
comprises:
obtaining pre-set quantity of network element indicator value before current
network element
26
Date recue / Date received 2021-11-30

indicator value as recent network element indicator value;
calculating recent mean value according to the recent network element
indicator value;
calculating recent variant range of network element indicator according to the
recent mean value
and the current value of network element indicator;
calculating sample variant range of network element indicator according to the
sample mean and
the current value of network element indicator; and
after weighted calculation based on the recent variant range and the sample
variant range of network
element indicator, combining with stability of network element indicator to
obtain the first variant
range of network element indicator.
9. The method for locating root cause of business system failure according to
claim 6, wherein, indicator
name identification and indicator value of network element indicator are
stored in a sliding matrix based
on time, each column of the sliding matrix corresponds to a minute span, each
row corresponds to an
indicator value identified by an indicator name identification; or each row
corresponds to a minute span,
each column corresponds to an indicator value identified by an indicator name
identification.
10. A system for locating root cause of business system failure, wherein,
comprising a calculation engine,
a distributed publish and subscribe message system and a network element, the
calculation engine
comprises a message reading module, a call-chain processing module, a network
element indicator
processing module, and a failure warning module, wherein,
the distributed publish and subscribe message system is configured to store
network element
indicators and generated call-chain messages when service interface deployed
by the network
element is called;
27
Date recue / Date received 2021-11-30

the message reading module is configured to read call-chain messages and
network element
indicator messages from the distributed publish and subscribe message system;
the call-chain processing module is configured to sampling the call-chain
messages according to
sampling ratio, grouping and assembling to obtain the call-chain, obtaining
time consumption of
each service interface in the call-chain, and filtering out failure service
interface according to time
consumption comparison table;
the network element indicator processing module is configured calculate first
variation range of the
network element indicator and filter out failure network element indicator
according to the first
variation range threshold value; and
the failure warning module is configured to correlate the failure service
interface with the failure
network element indicator according to time and issues failure warning, the
failure warning
comprises failure service interface and failure network element indicator,
wherein the failure
network element indicator causes the failure service interface to fail.
28
Date recue / Date received 2021-11-30

Description

Note: Descriptions are shown in the official language in which they were submitted.

METHOD AND SYSTEM FOR POSITIONING FAULT ROOT CAUSE OF SERVICE SYSTEM
Field
[0001] The present disclosure relates to technical field of computer cloud
environment operation and
maintenance monitoring, particularly to locate root cause of business system
failure method and system.
Background
[0002] At present, cloud technology continues to improve, private cloud and
public cloud continue to
appear, the operation and maintenance monitoring technology is also constantly
developing, the software
and hardware monitoring technology is also relatively rich, including the
monitoring of software and
hardware indicators in the cloud environment and the monitoring of the
business running in the cloud
environment. The monitoring of the software and hardware indicators in the
cloud environment, for
example: monitoring the CPU usage, the RAM usage, the DISK 10 usage, the NET
10 usage and the
number of Redis connections and other network element indicators of the
network element through
Prometheus; the business monitoring running in the cloud environment, for
example, analyzing and
marking the range of failure network element through business call-chain data,
etc.
[0003] However, the technology that combines the software and hardware
indicator monitoring of the
cloud environment and service monitoring running in the cloud environment is
relative scarce, when the
business system encounters failures such as increased time consumption request
responses, decreased
request success rate, and sudden decrease/surge of TPM, there is no method of
correlation analysis of the
two monitoring technologies to quickly locate root cause business system
failure, for example, the
production environment service interface A's successful request rate is low
and warning, if using traditional
methods to troubleshoot root cause, which needs to manually check the network
element indicator log file
and call-chain log file to conduct upstream and downstream investigations,
because a plurality of indicators
are involved which usually requires the collaboration of people from a
plurality of fields to find out root
1
Date recue / Date received 202 1-1 1-30

cause of business system failure, also requires time-consuming and labor-
intensive.
Invention Content
[0004] The pm-pose of the present invention is to provide a locating root
cause of business system failure
method and system, by combining the call-chain log and network element
indicator data to quickly locate
root cause of business system failure.
[0005] To achieve the above purpose, the present invention provides following
technical solutions:
[0006] A method for locating root cause of business system failure,
comprising:
[0007] Calculation engine obtains call-chain messages, sampling the call-chain
messages according to
sampling ratio, grouping and assembling to obtain the call-chain, obtaining
time consumption of each
service interface in the call-chain, and filtering out failure service
interface according to time consumption
comparison table;
[0008] Calculation engine obtains network element indicator messages,
calculating first variation range
of the network element indicator, filtering out failure network element
indicator according to the first
variation range threshold value;
[0009] Calculation engine correlates the failure service interface with the
failure network element
indicator according to time and issues failure warning, the failure warning
comprises failure service
interface and failure network element indicator, wherein the failure network
element indicator causes the
failure service interface to fail.
[0010] Preferably, the method of calculation engine obtains call-chain
messages and network element
2
Date recue / Date received 202 1-1 1-30

indicator messages comprises:
[0011] Network element periodically generates network element indicator
messages and stores in
indicator log file;
[0012] Generating call-chain messages when service interface deployed by
network element is called and
storing in call-chain unit log file;
[0013] Network element uses log file collection module to push newly added
network element indicator
messages of the indicator log file and newly added call-chain messages of the
call-chain unit log file to
distributed publish and subscribe message system;
[0014] Calculation engine reads call-chain messages and network element
indicator messages from the
distributed publish and subscribe message system.
[0015] Preferably, method of calculation engine filters out the failure
service interface comprises:
[0016] Obtaining and parsing the call-chain messages to get call-chain ID,
service interface name
identification and event corresponding to call-chain message, meanwhile,
obtaining call-chain entry
message;
[0017] Configuring sampling ratio of the call-chain messages, obtaining call-
chain entry message sample
according to the sampling ratio;
[0018] According to call-chain ID and event time to filter out call-chain
messages belonging to the same
group as the call-chain entry message sample, assembling all call-chain
message of the same group to get
call-chain information;
3
Date recue / Date received 202 1-1 1-30

[0019] Obtaining time consumption of each service interface in call-chain and
filtering out failure service
interface according to time consumption comparison table.
[0020] Specifically, the method of parsing the call-chain messages to get call-
chain entry message
comprises:
[0021] Setting expiration time of the call-chain messages, filtering and
parsing unexpired call chain
messages according to the expiration time, obtaining call-chain ID, service
interface name identification
and event time of the call-chain messages;
[0022] Filtering out call-chain entry message in the call-chain messages
according to call-chain ID and
service interface name identification.
[0023] Furthermore, the method of filtering out failure service interface
according to time consumption
comparison table comprises:
[0024] Pre-calculating average time consumption corresponding to service
interface, storing service
interface name identification of the service interface and the average time in
the time consumption
comparison table with one-to-one correspondence;
[0025] Obtaining service interface in call-chain, wherein the service
interface takes more time than the
average time in the time consumption comparison table, storing the service
interface as alternative failure
service interface;
[0026] Calculating failure credibility of alternative failure service
interface according to actual time and
correspondingly average time of the alternative failure service interface,
filtering out alternative failure
4
Date recue / Date received 202 1-1 1-30

service interface with greater failure credibility than pre-set credibility
threshold and storing as alternative
failure service interface.
[0027] Preferably, the method of calculation engine filters out failure
network element comprises:
[0028] Pre-calculating sample mean and stability of network element, and
correspondingly storing
indicator name identification, stability and sample mean of network element in
stability comparison table;
[0029]
Obtaining and parsing network element indicator messages, obtaining indicator
name
identification and current value of the network element indicator;
[0030] Searching for correspondingly sample mean and stability from the
stability comparison table
according to the indicator name identification;
[0031] According to the current value of network indicator and the
correspondingly sample mean and
stability, calculating first variant range of network element indicator,
storing the network element indicator
with greater first variant range than pre-set first variant threshold as
failure network element indicator.
[0032] Specifically, the method of pre-calculating sample mean and stability
of network element
comprises:
[0033] Collecting x network element indicator samples;
[0034] Calculating mean value of the network element indicator samples and
second variant range of
each network element indicator sample, counting quantity n of failure network
element indicator samples
with greater second variant range than pre-set second variant threshold;
Date recue / Date received 202 1-1 1-30

[0035] Calculating stability m of network element indicator, wherein, m = x¨n.
x
[0036] Preferably, the method of calculation engine filters out failure
network element indicator also
comprises:
[0037] Obtaining pre-set quantity of network element indicator value before
current network element
indicator value as recent network element indicator value;
[0038] Calculating recent mean value according to the recent network element
indicator value;
[0039] Calculating recent variant range of network element indicator according
to the recent mean value
and the current value of network element indicator;
[0040] Calculating sample variant range of network element indicator according
to the sample mean and
the current value of network element indicator;
[0041] After weighted calculation based on the recent variant range and the
sample variant range of
network element indicator, combining with stability of network element
indicator to obtain the first variant
range of network element indicator.
[0042] Preferably, indicator name identification and indicator value of
network element indicator are
stored in a sliding matrix based on time, wherein, each column of the sliding
matrix corresponds to a minute
span, each row corresponds to an indicator value identified by an indicator
name identification; or each row
corresponds to a minute span, each column corresponds to an indicator value
identified by an indicator
name identification.
[0043] A system for locating root cause of business system failure, wherein,
comprising a calculation
6
Date recue / Date received 202 1-1 1-30

engine, a distributed publish and subscribe message system and a network
element, the calculation engine
comprises a message reading module, a call-chain processing module, a network
element indicator
processing module, and a failure warning module, wherein,
[0044] The distributed publish and subscribe message system is configured to
store network element
indicators and generated call-chain messages when service interface deployed
by the network element is
called;
[0045] The message reading module is configured to read call-chain messages
and network element
indicator messages from the distributed publish and subscribe message system;
[0046] The call-chain processing module is configured to sampling the call-
chain messages according to
sampling ratio, grouping and assembling to obtain the call-chain, obtaining
time consumption of each
service interface in the call-chain, and filtering out failure service
interface according to time consumption
comparison table;
[0047] The network element indicator processing module is configured calculate
first variation range of
the network element indicator and filter out failure network element indicator
according to the first variation
range threshold value;
[0048] The failure warning module is configured to correlate the failure
service interface with the failure
network element indicator according to time and issues failure warning, the
failure warning comprises
failure service interface and failure network element indicator, wherein the
failure network element
indicator causes the failure service interface to fail.
[0049] Comparing with the prior art, the method and system for locating root
cause of business system
failure provided by the present invention has the following beneficial
effects:
7
Date recue / Date received 202 1-1 1-30

[0050] The method for locating root cause of business system failure provided
by the present invention,
after calculation engine obtains call-chain messages and network element
indicator, performing failure
analysis respectively, filtering out failure service interface and failure
network element indicator, then
correlating failure service interface with failure network element indicator
according to time, to identify
and warn the root cause of service failure, in other words, the failure
service interface and the failure
network element indicator which caused the failure interface to fail reduce
the complexity of
troubleshooting for operation and maintenance personnel, and save the time and
labor cost for
troubleshooting.
[0051] The system for locating root cause of business system failure provided
by the present invention
adopts the above-mentioned method for locating root cause of business system
failure, by combining call-
chain log and network element indicator data to quickly locate the root cause
of business system failure.
Drawing Description
[0052] The drawings described here are used to provide further understandings
of the present
invention and constitute a part of the present invention. The illustrated
exemplary implementations
and descriptions are used to explain the present invention , and do not
constitute an improper limitation
of the present invention. For the attached figures:
[0053] Figure 1 is a process diagram of a method for locating root cause of
business system failure in the
implementation of the present invention;
[0054] Figure 2 is a task logic diagram of computer engine in the
implementation of the present invention;
[0055] Figure 3 is a system architecture diagram of locating root cause of
business system failure in the
8
Date recue / Date received 202 1-1 1-30

implementation of the present invention.
Specific implementation methods
[0056] In order to make clearer purpose, technical solutions and benefits of
the present invention, the
following will clearly and completely describe the technical solutions of the
implementations in the present
application with accompanying drawings, obviously the described
implementations are only a part of the
implementations in the present application. Based on the implementations in
the present application, all
other implementations obtained by those of ordinary skilled in the art will
fall in the protection scope of the
present application.
[0057] Implementation one
[0058] Please refer to Figure 1, a method for locating root cause of business
system failure provided by
the implementation of the present invention, comprising:
[0059] Calculation engine obtains call-chain messages, sampling the call-chain
messages according to
sampling ratio, grouping and assembling to obtain the call-chain, obtaining
time consumption of each
service interface in the call-chain, and filtering out failure service
interface according to time consumption
comparison table;
[0060] Calculation engine obtains network element indicator messages,
calculating first variation range
of the network element indicator, filtering out failure network element
indicator according to the first
variation range threshold value;
[0061] Calculation engine correlates the failure service interface with the
failure network element
indicator according to time and issues failure warning, the failure warning
comprises failure service
9
Date recue / Date received 202 1-1 1-30

interface and failure network element indicator, wherein the failure network
element indicator causes the
failure service interface to fail.
[0062] The method for locating root cause of business system failure provided
by the implementation the
present invention, after calculation engine obtains call-chain messages and
network element indicator,
performing failure analysis respectively, filtering out failure service
interface and failure network element
indicator, then correlating failure service interface with failure network
element indicator according to time,
to identify and warn the root cause of service failure, in other words, the
failure service interface and the
failure network element indicator which caused the failure interface to fail
reduce the complexity of
troubleshooting for operation and maintenance personnel, and save the time and
labor cost for
troubleshooting.
[0063] Please refer to Figure 2 or Figure 3, the method for locating root
cause of business system failure
provided by the implementation of the present invention, the method of
calculation engine obtains call-
chain messages and network element indicator messages comprises:
[0064] Network element periodically generates network element indicator
messages and stores in
indicator log file;
[0065] Generating call-chain messages when service interface deployed by
network element is called and
storing in call-chain unit log file;
[0066] Network element uses log file collection module to push newly added
network element indicator
messages of the indicator log file and newly added call-chain messages of the
call-chain unit log file to
distributed publish and subscribe message system;
[0067] Calculation engine reads call-chain messages and network element
indicator messages from the
Date recue / Date received 202 1-1 1-30

distributed publish and subscribe message system.
[0068] Those skilled in art should now that the network element refers to the
basic service unit in cloud
environment and has globally unique identification name, IP and other
information, for example: docker
container with identification name docker 001; network element indicator
refers to specific indicator
monitored in the network element which has indicator name, indicator value at
some time point, such as:
RAM usage, CPU usage, Redis connections, etc.; call-chain refers to call
system, network element, and key
embedding method through which a business request passes, the main dimensions
involved include:
network element unique identification, service interface name, call-chain ID,
service interface ID (unique
in the call-chain), caller service ID, start time, service interface execution
time, success and failure
identification, etc.
[0069] In specific implementation, the network element periodically (1 second,
30 seconds, 60 seconds,
300 seconds, etc., the time period can be customized) generates monitoring
network element indicator
information storing in the indicator log file, and the log collection module
(such as flume) integrated in the
network element can monitor the changes in the indicator log file, then
pushing the new added network
element indicator messages in the indicator log to Kafka and other distributed
publish and subscribe
message system, wherein, indicator log file is rolling generation, historical
indicator log file will be
periodically cleared; network element has deployed service interface, the
service interface generates call-
chain messages when business is called, and storing the messages in the unit
log of call-chain., the unit log
of call-chain is also pushed to distributed publish and subscribe message
system by the log collection
module, then calculation engine reads call-chain messages and network element
indicator messages from
the distributed publish and subscribe message system.
[0070] Please refer to Figure 2, the method for locating root cause of
business system failure provided by
the implementation of the present invention, the method of calculation engine
filters out the failure service
interface comprises:
11
Date recue / Date received 202 1-1 1-30

[0071] Obtaining and parsing the call-chain messages to get call-chain ID,
service interface name
identification and event corresponding to call-chain message, meanwhile,
obtaining call-chain entry
message;
[0072] Configuring sampling ratio of the call-chain messages, obtaining call-
chain entry message sample
according to the sampling ratio;
[0073] According to call-chain ID and event time to filter out call-chain
messages belonging to the same
group as the call-chain entry message sample, assembling all call-chain
message of the same group to get
call-chain information;
[0074] Obtaining time consumption of each service interface in call-chain and
filtering out failure service
interface according to time consumption comparison table.
[0075] Wherein, the method of parsing the call-chain messages to get call-
chain entry message comprises:
[0076] Setting expiration time of the call-chain messages, filtering and
parsing unexpired call chain
messages according to the expiration time, obtaining call-chain ID, service
interface name identification
and event time of the call-chain messages;
[0077] Filtering out call-chain entry message in the call-chain messages
according to call-chain ID and
service interface name identification.
[0078] In specific implementation, after calculation engine receives a call-
chain message, it will first
judges whether the message has expired according to the event time in the call-
chain message. If the
message exceeds the set expiration time, then discarding directly. If the
message does not exceed the set
expiration time, then parsing the call-chain ID and service interface name
identification, and then filtering
12
Date recue / Date received 202 1-1 1-30

out the call-chain entry message in the call chain message according to the
call-chain ID and service
interface name identification, the data format of call-chain entry message can
be (service interface name
identification, call-chain ID, event time), real-time calculating the TPS
(Transactions Per Second, that is
the number of transactions processed by the server per second) identified by
the service interface name of
the call-chain entry message, then configuring the call-chain message
according to the sampling ratio,
obtaining a sample of the call-chain entry message according to the sampling
ratio.
[0079] In the sampling process, first loading the sampling gradient
corresponding to the configuring TPS,
each level of sampling gradient corresponds to different TPS range and
sampling ratio mapping, different
sampling ratio are mapped to the rotation range, for example, the sampling
ratio is 10%, the rotation range
is 10, when the first data comes, the first data goes to sampling, the next 9
will not go to sampling, the 11th
data will enter subsequent cycle which is the same as the first data in the
subsequent cycle, and will go to
sampling. After each sampling, the sampling gradient is obtained in real time
according to the TPS value,
and the correspondingly sampling ratio and cycle are obtained according to the
sampling gradient. The call-
chain entry message included in the sampling and not included in the sampling
will all be retained, but will
be marked whether the call-chain entry message is a sample, the data format of
the marked call-chain entry
message is (service connection name identification, all-chain ID, event time,
whether it is sample), and
temporarily storing in the sampling comparison table, the call-chain entry
message in the sampling
comparison table can be set to 60 seconds to expire.
[0080] Then all unexpired call-chain messages are sampled based on the
sampling comparison table, that
is, after receiving the call-chain message, checking the sampling comparison
table according to the call-
chain ID and event time. If the call-chain entry message corresponding to the
call-chain message in the
sampling comparison table is marked as a sample, then the call-chain message
and the correspondingly
call-chain entry message are correspondingly stored as the same group of data;
if the call-chain entry
message corresponding to the call-chain message in the sampling comparison
table is marked as non-
sample, then discarding the call-chain message; if the sampling comparison
table does not have a call-chain
13
Date recue / Date received 202 1-1 1-30

entry message corresponding to the call-chain message, then caching the call-
chain message for 60 seconds,
and checking whether there is a corresponding call-chain message after the
sampling comparison table is
updated within 60 seconds, if the corresponding call-chain entry message is
still not found after 60 seconds,
discarding the call-chain message. In this sampling method, multi-level
sampling gradients are set, the
sampling ratio can be adjusted according to the amount of data, which is
conducive to flexibly coping with
different amounts of data and will not cause system instability due to the
surge of the message amount; also
performing balanced sampling according to the amount of TPS of the business
line (call link) to ensure that
data can be sampled for each business line.
[0081] In actual sampling, sampled call-chain messages can be grouped caching
according to "minutes" (
absolute value of minutes converted from the event time, starting from 0:00 on
January 1, 1970) and "call-
chain ID" , when each call-chain message arrives which will be judged whether
all the messages of the
current call-chain have arrived, if all messages arrive, the call-chain will
be assembled and the actual time
consumption of each node in the call-chain will be obtained at the same time,
that is the time consumption
of each service interface in the call-chain, and finally the failure service
interface is filtered out according
to the time consumption comparison table. Wherein, the method for filtering
out failure service interface
according to the time consumption comparison table includes:
[0082] Pre-calculating average time consumption corresponding to service
interface, storing service
interface name identification of the service interface and the average time in
the time consumption
comparison table with one-to-one correspondence;
[0083] Obtaining service interface in call-chain, wherein the service
interface takes more time than the
average time in the time consumption comparison table, storing the service
interface as alternative failure
service interface;
[0084] Calculating failure credibility of alternative failure service
interface according to actual time and
14
Date recue / Date received 202 1-1 1-30

correspondingly average time of the alternative failure service interface,
filtering out alternative failure
service interface with greater failure credibility than pre-set credibility
threshold and storing as alternative
failure service interface.
[0085] In specific implementation, the time consumption comparison table can
be generated every time
the system for locating root cause of the failure is started or can be
directly imported into the locally existing
time consumption comparison table. For example, after starting the system for
locating root cause of the
failure, after receiving the call-chain information, the average time
consumption corresponding to the
service interface is calculated and the time consumption comparison table is
updated in real time, when the
time consumption data collected by a certain service interface in the
comparison table reaches a credible
sample size, marking the service interface in the time consumption comparison
table for failure judgement,
the updated and completed time consumption comparison table can be stored
locally and can be directly
imported to use when the system for locating root cause of failure is
restarted.
[0086] For the assembled call-chain, analyzing the node (service interface)
that has failed in the call-chain
according to the time consumption comparison table, in other words, the
service interface that actually takes
more time that the average time in the time consumption comparison table in
the call-chain is stored as an
alternative failure serve interface, at the same time, calculating the failure
credibility of the alternative
failure service interface according to the actual time and the correspondingly
average time of the alternative
failure service interface, filtering out the alternative failure service
interface which has greater failure
credibility than the pre-set credibility threshold, storing as a failure
service interface, the failure service
interface can be stored correspondingly with the entire call-chain in which it
is located, so as to troubleshoot
the network element indicator from upstream and downstream. Among them,
failure credibility = actual
time consumption / recent average time consumption value, the recent time
consumption average value can
be calculated by taking 360 time consumption samples of a specific sample
size, the larger the failure
credibility value, the higher the credibility, the credibility threshold can
be set to 1.8 and which can be
adjusted according to actual business.
Date recue / Date received 202 1-1 1-30

[0087] Please refer to Figure 2, in an implementation of the present
invention, a method for locating root
cause of business system failure is provided, since the network element
indicator collection is at a fixed
frequency and there will be no sudden surge, so directly analyzing the full
amount of data without sampling,
the method of filtering out the failure network element indicator comprises:
[0088] Pre-calculating sample mean and stability of network element, and
correspondingly storing
indicator name identification, stability and sample mean of network element in
stability comparison table;
[0089]
Obtaining and parsing network element indicator messages, obtaining indicator
name
identification and current value of the network element indicator;
[0090] Searching for correspondingly sample mean and stability from the
stability comparison table
according to the indicator name identification;
[0091] According to the current value of network indicator and the
correspondingly sample mean and
stability, calculating first variant range of network element indicator,
storing the network element indicator
with greater first variant range than pre-set first variant threshold as
failure network element indicator.
[0092] In specific implementation, indicator name identification and indicator
value of network element
indicator are stored in a sliding matrix, each column of the sliding matrix
corresponds to a minute span,
each row corresponds to an indicator value identified by an indicator name
identification; or each row
corresponds to a minute span, each column corresponds to an indicator value
identified by an indicator
name identification. After the new network element indicator information
arrives which is added to the
sliding matrix, the default width of the sliding matrix is 128, each row
corresponds to the combination of
the indicator name identification of storage network element and the indicator
value of the network element,
each column corresponds to 1 minute span by default, in other words, a row of
data actually stores the
16
Date recue / Date received 202 1-1 1-30

indicator value of a certain network element indicator name within 128
minutes, among them, the minute
span can be adjusted according to different sampling frequencies to generate
different sliding matrices,
putting network element indicators of the same sampling frequency in the same
sliding matrix, the sliding
matrix periodically clears expired data and initializes new data, if the width
of the sliding matrix is 128,
each column corresponds to 1 minute span by default, the data valid time is
128 min.
[0093] Specifically, in the method for the calculation engine to filter out
the failure network
element indicators, the method of pre-calculating sample mean and stability of
network element
comprises:
[0094] Collecting x network element indicator samples;
[0095] Calculating mean value of the network element indicator samples and
second variant range of each
network element indicator sample, counting quantity n of failure network
element indicator samples with
greater second variant range than pre-set second variant threshold;
[0096] Calculating stability m of network element indicator, wherein, m = x¨n.
[0097] For example, the network element indicator s has 360 continuous sample
indicator values si,
wherein, 1 i 360, the average value of the sample indicator value is v,
setting the second variant range
ai = Isi ¨ vl/v, setting the second variant range threshold to 0.25, when ai>
0.25, which considering that
the network element indicator s has actual changes, if the number of actual
changes is n, then the stability
m of the network element indicator s is m = (360 ¨ n)/360.
[0098] In addition, the method of calculation engine filters out failure
network element indicator also
comprises:
17
Date recue / Date received 202 1-1 1-30

[0099] Obtaining pre-set quantity of network element indicator value before
current network element
indicator value as recent network element indicator value;
[0100] Calculating recent mean value according to the recent network element
indicator value;
[0101] Calculating recent variant range of network element indicator according
to the recent mean value
and the current value of network element indicator;
[0102] Calculating sample variant range of network element indicator according
to the sample mean and
the current value of network element indicator; and
[0103] After weighted calculation based on the recent variant range and the
sample variant range of
network element indicator, combining with stability of network element
indicator to obtain the first variant
range of network element indicator.
[0104] For example, setting the first variant range of network element
indicators is q, the current value is
s1, sample mean is v, recent mean is k, stability is m, then
[0105] q = ( __ ' ______ x 0.4 + x 0.6) x m
[0106] In specific implementation, 5 to 10 recent samples can be taken to
calculate the recent sample mean
k, 0.4 and 0.6 are respectively the weight of the recent variant range and the
same variant range in the
calculation process of the first variant range, it can be adjusted according
to the specific situation, finally,
determining whether the network element indicator is a failure network element
indicator according to
whether the first variant range exceeds the first variant range threshold, the
first variant range threshold can
be set to 0.45, and this value can also be adjusted according to specific
conditions.
18
Date recue / Date received 202 1-1 1-30

[0107] Finally, calculation engine correlates the failure service interface
with the failure network element
indicator according to time and issues failure warningõ the failure warning
comprises failure service
interface and failure network element indicator, wherein the failure network
element indicator causes the
failure service interface to fail. The failure service interface and the
failure network indicator can be stored
in cache media such as URedis, for example, Redis stores failure service
interfaces and failure network
element indicators that occur within 15 minutes for failure correlation
analysis to call.
[0108] In specific implementation, according to failure "minutes" and
correlation of network element, the
service interface failure of the call-chain is associated with the network
element indicator failure and
ranking the probability of the network element indicator failure that caused
the service interface failure,
when the number of occurrences of a service interface failure within a minute
exceeds the threshold, the
issuing a failure warning. If the number of service interface failure
increases in the subsequent time,
determining that the service interface failure continues, and issuing a
failure warning, meanwhile, the
determined service interface failure and the network element indicator failure
that caused the failure are
correspondingly stored for later analysis; if there is no change in the number
of the service interface failure
in the subsequent time or disappears in the next minute, the failure warning
will be cancelled. Failure
warning such as:
[0109] 1. Initial announcement content of failure release.
[0110] Failure[warning], No.AAAA-BBBB-CCCC-NNNNNNN, the service interface i on
the network
element A fails, and the service time exceeds the standard value by 80%, the
standard value is Xa, the
current average time consumption is xa, the possible causes are ranked: 1. The
indicator i on the network
element B is anomaly; 2. The indicator j on the network element A is anomaly.
[0111] 2. The failure continues after 1 minute.
19
Date recue / Date received 202 1-1 1-30

[0112] Failure[warning], No.AAAA-BBBB-CCCC-NNNINNNN, the service interface i
on the network
element A fails, and the service time exceeds the standard value by 80%, the
standard value is Xa, the
current average time consumption is xa, the possible causes are ranked: 1. The
indicator i on the network
element B is anomaly; 2. The indicator j on the network element A is anomaly.
[0113] 3. The failure disappears after 1 minute.
[0114] Failure[dismiss], No.AAAA-BBBB-CCCC-NNNNNNN, the service interface i on
the network
element A fails, and the service time exceeds the standard value by 80%, the
standard value is Xa, the
current average time consumption is xa, the possible causes are ranked: 1. The
indicator i on the network
element B is anomaly; 2. The indicator j on the network element A is anomaly.
[0115] In addition, the service interface deployed by the network element will
also generate an exception
stack message when the interface is called and storing in the exception stack
log, the exception stack log is
also pushed to Kafka and other distributed publish and subscribe message
system through the log collection
module, the calculation engine obtains the exception stack message from the
distributed publish and
subscribe message system, parsing to obtain call-chain ID, service ID, and
exception stack data and storing
in the distributed document database, for example in ES. When user checks the
failure according to the
failure warning, the correspondingly indicator of ES will be checked according
to the date information, and
the exception stack can be queried and displayed through the call-chain ID,
service ID, etc., the specific
type and code line of the exception can be located in the exception stack.
[0116] The method for locating root cause of business system failure provided
in this implementation can
adapt to new business scenarios without human intervention, after self-
learning is completed, obtaining
reference values such as time consumption comparison table, then locating the
failure. In addition, in the
process of locating root cause of business system failure, the failure
sensitivity can be adjusted which can
provide early warning and reduce the failure sensitivity to enhance the
accuracy of the early warning, also
Date recue / Date received 202 1-1 1-30

effectively reducing the complexity of troubleshooting for the operation and
maintenance personnel and
the person in charge of system, saving time and labor costs.
[0117] Implementation two
[0118] As shown in Figure 3, a system for locating root cause of business
system failure provided by the
implementation of the present invention, the system comprises a calculation
engine, a distributed publish
and subscribe message system and a network element, the calculation engine
adopts Flink which comprises
a message reading module, a call-chain processing module, a network element
indicator processing module,
and a failure warning module. Wherein the distributed publish and subscribe
adopts Kafka which is
configured to store network element indicators and generated call-chain
messages when service interface
deployed by the network element is called; the message reading module is
configured to read call-chain
messages and network element indicator messages from the distributed publish
and subscribe message
system; the call-chain processing module is configured to sampling the call-
chain messages according to
sampling ratio, grouping and assembling to obtain the call-chain, obtaining
time consumption of each
service interface in the call-chain, and filtering out failure service
interface according to time consumption
comparison table; the network element indicator processing module is
configured calculate first variation
range of the network element indicator and filter out failure network element
indicator according to the first
variation range threshold value; the failure warning module is configured to
correlate the failure service
interface with the failure network element indicator according to time and
issues failure warning, the failure
warning comprises failure service interface and failure network element
indicator, wherein the failure
network element indicator causes the failure service interface to fail.
[0119] The system for locating root cause of business system failure provided
by the present invention
uses the method of locating root cause of business system failure in the above-
mentioned first
implementation, by combining the call-chain log and network element indicator
data, the root cause of
business system failure can be quickly located and maintenance costs are
reduced. Comparing with the
21
Date recue / Date received 202 1-1 1-30

prior art, the beneficial effects of the system for locating root cause of
business system failure provided by
the implementation of the present invention are the same as the beneficial
effects of the method for locating
the root cause of business system failure provided in the first
implementation, and other technical features
in the system are the same as those disclosed features of the method in the
previous implementation, which
will not repeat here.
[0120] In the above-mentioned descriptions of implementation methods, specific
features, structures,
materials or characteristics can be combined in any one or more
implementations or examples in a suitable
way.
[0121] As the above-mentioned, which are only specific implementations of the
present invention, but the
protection scope of the present invention is not limited thereto, any person
skilled in the art can easily think
of changes or substitutions within the technical scope disclosed by the
present invention, it should be
covered within the protection scope of the present invention. Therefore, the
protection scope of the present
invention should be subject to the protection scope of the claims.
22
Date recue / Date received 202 1-1 1-30

Representative Drawing

A single figure which represents the drawing illustrating the invention.

Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee and Payment History should be consulted.

Administrative Status

Title	Date
Forecasted Issue Date	Unavailable
(22) Filed	2021-11-30
(41) Open to Public Inspection	2022-05-30
Examination Requested	2022-09-16

Abandonment History

There is no abandonment history.

Maintenance Fee

Last Payment of $100.00 was received on 2023-12-15

Upcoming maintenance fee amounts

Description	Date	Amount
Next Payment if small entity fee	2025-12-01	$50.00
Next Payment if standard fee	2025-12-01	$125.00

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

the reinstatement fee;
the late payment fee; or
additional fee to reverse deemed expiry.

Patent fees are adjusted on the 1st of January every year. The amounts above are the current amounts if received by December 31 of the current year.
Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Payment History

Fee Type	Anniversary Year	Due Date	Amount Paid	Paid Date
Application Fee		2021-11-30	$408.00	2021-11-30
Request for Examination		2025-12-01	$814.37	2022-09-16
Maintenance Fee - Application - New Act	2	2023-11-30	$100.00	2023-06-15
Maintenance Fee - Application - New Act	3	2024-12-02	$100.00	2023-12-15

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
10353744 CANADA LTD.

Past Owners on Record
None

Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.

Documents

To view selected files, please enter reCAPTCHA code :

To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Filter

Download Selected in PDF format (Zip Archive)

Download Selected as Single PDF

Document Description	Date (yyyy-mm-dd)	Number of pages	Size of Image (KB)
New Application	2021-11-30	6	221
Abstract	2021-11-30	1	26
Description	2021-11-30	22	915
Claims	2021-11-30	6	203
Drawings	2021-11-30	3	71
Representative Drawing	2022-05-10	1	14
Cover Page	2022-05-10	1	51
Request for Examination	2022-09-16	9	300
Correspondence for the PAPS	2022-12-23	4	150
Examiner Requisition	2023-12-19	5	254
Amendment	2024-04-19	27	1,043
Claims	2024-04-19	7	380

Language selection

Menus

English Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.

Patent 3140769 Summary

English Abstract

Administrative Status

Abandonment History

Maintenance Fee

Payment History

Your request is in progress.Requested information will be availablein a moment.Thank you for waiting.

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.