Note: Descriptions are shown in the official language in which they were submitted.
CA 02764316 2017-01-20
Improved Systems, Methods, and Interfaces for Extending Legal
Search Results
Copyright Notice and Permission
A portion of this patent document contains material subject to copyright
protection.
The copyright owner has no objection to the facsimile reproduction by anyone
of the patent
document or the patent disclosure, as it appears in the Patent and Trademark
Office patent
files or records, but otherwise reserves all copyrights whatsoever. The
following notice
applies to this document: Copyright 2010, Thomson Reuters Global Resources.
Cross-Reference to Related Application
This non-provisional patent application claims priority to U.S. Provisional
Patent
Application Serial No. 61/217,522 filed June 1, 2009. This application is also
claims priority
to, and is a continuation in part application of, U.S. Patent Application
Serial Number
11/538,749 filed October 4, 2006 entitled "Systems, Methods, and Software for
Identifying
Relevant Legal Documents" (now Publication No. U.S. 2008/0033929 Al), which
claims
priority to U.S. Provisional Patent Application No. 60/723,322 filed October
4, 2005.
Technical Field
The present invention relates to systems, methods and interfaces for providing
information in response to a computerized search for legal content.
Background
The American legal system, as well as some other legal systems around the
world,
relies heavily on a concept know as stare decisis. This Latin phrase means as
"to stand by things
decided." The phrase "stare decisis" is itself an abbreviation of the Latin
phrase "stare decisis
et non quieta movere" which means "to stand by decisions and do not move that
which is quiet."
1
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
The American legal system practices stare decisis by deciding similar cases in
similar fashion
and not overruling previously established law absent a good reason to do so.
Legal professionals, such as paralegals, judges and lawyers, constantly apply
the
principle of stare decisis by using legal content to support their arguments
and positions. In
general, there are two types of sources of legal content, primary sources and
secondary sources.
Primary sources are judicial opinions dealing with the same legal issue.
Primary sources are said
to be "binding" if the primary source is from a higher court (or the same
exact court) than the
court currently deciding the legal issue and the higher court is in the "chain
of command" of the
court currently deciding the issue. For example, the U.S. Supreme Court's
opinions are binding
on all other courts deciding the same issue. However, a federal district
court's opinion from,
e.g., New York, is not binding upon a federal district court in Pennsylvania
deciding the same
issue. The opinion of the New York court is considered persuasive (but not
binding) primary
authority. Secondary sources comprise other legal content such as law review
and other
scholarly articles briefs, motions, and administrative decisions.
Ideally, legal professionals would always be able to support their positions
and arguments
with primary authority that is binding. However, this is not always practical
due to many reasons
including, for example, new issues arising in the law which courts have yet to
address. Further,
parties may also have different interpretations of how or if a particular
previous judicial opinion
applies to their dispute. In fact, parties may not even agree upon what the
legal issue that needs
to be resolved.
In order to provide clients with fast and superior legal services, many legal
professionals
use computerized research to attempt to find primary authority that is
binding. This helps control
legal costs while making legal professionals choosing to use computerized
research more
efficient.
However, even with computerized research, legal professionals cannot be fully
aware of
all potential legal issues due to the vast amount of legal information legal
professionals may have
to review. For example, West Publishing Company of St. Paul, Minnesota (doing
business as
Thomson West) (hereinafter "West") offers the ability for legal professionals
to conduct
computerized research on over 100 million documents. West collects legal
content from various
sources and makes them available electronically through its Westlaw
information-retrieval
system. (Westlaw is a trademark of West). Searchable documents include
documents from
2
CA 02764316 2017-01-20
both primary and secondary sources. Further, the West Key NumberTM System,
which
provides classified summaries of legal points, made in judicial opinions, is
also searchable
(West Key NumberTM is a trademark of Thomson West). The summaries, known as
Headnotes, are classified into more than 90,000 distinct legal categories, and
can be used for
a variety of purposes, such as evaluating the relevance of legal opinions to
particular legal
issues. Secondary resources, such as American Law Reports (ALRO), include
about 4,000
in-depth scholarly articles, each teaching about a separate legal issue.
Multiple systems used by legal professionals, including Westlaw, , have
addressed the
ability of a legal professional performing a search to quickly become familiar
with other
potentially relevant legal issues. For example, Westlaw currently provides
legal professionals
with a feature known as ResultsPlusg. This feature is described in U.S. Patent
Application
Serial No. 11/028,476 filed January 3, 2005, entitled "Systems, Methods,
Interfaces and
Software for Extending Search Results Beyond Initial Query-Defined
Boundaries."
Essentially, this feature provides the legal professional with a screen with
links to: (1) a first
set of documents responsive to their search query; and (2) a second set of
documents
including documents outside the boundaries of the original search query. The
feature does so
in a manner that causes the first set of documents and the second set of
documents to be
displayed separately within a single graphical user interface ("GUI").
There exists a need to further improve the searching of legal documents by
legal
professionals.
Summary of the Invention
We have recognized that legal professionals are more efficient in conducting
legal
research if the second set of documents which is outside the boundaries of the
original search
query is organized into clusters. More specifically, we have invented a method
comprising
receiving a first signal indicative of a selection of a legal document
associated with a set of
metadata; based upon the set of metadata, picking a first cluster of legal
documents and a
second cluster of legal documents, the first cluster of legal documents being
associated with a
first legal topic and the second cluster of legal documents being associated
with a second legal
3
CA 02764316 2017-01-20
topic; and transmitting a second signal relating to the first cluster of legal
documents and the
second cluster of legal documents.
Advantageously, the present invention permits legal professionals to conduct
legal
research more effectively by providing the legal professional with information
regarding clusters
of documents associated with a document the legal professional has viewed
and/or accessed in
some fashion.
In accordance with an aspect of the present invention there is provided a
method
comprising:
receiving a search request signal, the search request signal comprising a
search request
for a set of legal information;
identifying a set of legal documents in response to the search request signal;
transmitting
a search response signal associated with the set of legal documents;
receiving a selection signal, the selection signal indicative of a selection
of a given legal
document from the set of legal documents, the legal document being associated
with a set of
metadata; based upon the set of metadata, picking a first cluster of legal
documents and a second
cluster of legal documents, the first cluster of legal documents being a first
set of documents
comprising one or more primary resources and one or more secondary resources
grouped
according to a first legal topic which the first set of documents hold in
common and the second
cluster of legal documents being a second set of documents comprising one or
more primary
resources and one or more secondary resources grouped according to a second
legal topic which
the second set of legal documents hold in common,
wherein the picking of the first cluster of legal documents and the second
cluster of legal
documents is based upon, respectively:
a first similarity score between the legal document and the first set of
documents
belonging to the first cluster of legal documents, wherein the first
similarity score is
based upon a measure of associated citation information between the legal
document and
the first set of documents, wherein the associated citation information
comprises editorial
annotations, hierarchical categorization of legal topics, and relationship
between legal
topics;
a second similarity score between the legal document and the second set of
documents belonging to the second cluster of legal documents, wherein the
second
4
CA 02764316 2017-01-20
similarity score is based upon a measure of associated citation information
between the
legal document and the second set of documents, wherein the associated
citation
information comprises editorial annotations, hierarchical categorization of
legal topics,
and relationship between legal topics; and
transmitting a second signal relating to the first cluster of legal documents
and the
second cluster of legal documents.
In accordance with a further aspect of the present invention there is provided
a computing
device comprising:
a processor;
a memory operatively coupled to the processor, the memory storing instructions
that cause
the processor to:
receive a search request signal, the search request signal
comprising a search request for a set of legal information; identify a set of
legal documents
in response to the search request signal, the set of legal documents
comprising the legal document;
transmit a signal associated with the set of legal documents;
receive a selection signal, the selection signal indicative of a selection of
a given legal
document from the set of legal documents, the legal document being associated
with a set of
metadata;
based upon the set of metadata, pick a first cluster of legal documents and a
second cluster
of legal documents, the first cluster of legal documents being a first set of
documents comprising
one or more primary resources and one or more secondary resources grouped
according to a first
legal topic which the first set of documents hold in common and the second
cluster of legal
documents being a second set of documents comprising one or more primary
resources and one or
more secondary resources grouped according to a second legal topic which the
second set of legal
documents hold in common,
wherein picking the first cluster of legal documents and the second cluster of
legal
documents is based upon, respectively:
a first similarity score between the legal document and a first set of
documents
belonging to the first cluster of legal documents, wherein the first
similarity score is based
upon a measure of associated citation information between the legal document
and the first
set of documents, wherein the associated citation information comprises
editorial
4a
CA 02764316 2017-01-20
annotations, hierarchical categorization of legal topics, and relationship
between legal
topics; and
a second similarity score between the legal document and a second set of
documents belonging to the second cluster of legal documents, wherein the
second
similarity score is based upon a measure of associated citation information
between the
legal document and the second set of documents, wherein the associated
citation
information comprises editorial annotations, hierarchical categorization of
legal topics, and
relationship between legal topics; and
transmit a second signal relating to the first cluster of legal documents and
the
second cluster of legal documents.
Other advantages will be apparent to those skilled in the art based upon the
remainder of
the specification, including the figures.
Brief Description of the Figures
Figure 1 is a diagram of a system corresponding to one embodiment of the
invention;
Figure 2 is a flowchart corresponding to the operation of the system of Figure
1;
Figures 3 A through 31 are screenshots of what a user may see when executing
methods
in accordance with the invention; and
Figures 4A through 4L relate to the generation of clusters of legal documents.
Detailed Description
Background
In addition to providing this background section, this detailed description
will describe a
system in which the invention may be implemented, including the system's
components and
structure. Next, the detailed description will describe the operation of the
system, including a
legal professional's interactions with the system and resulting displays
associated with clusters of
legal documents. Next, the detailed description will describe how clusters of
legal documents
4b
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
are originally generated. Finally, the detailed description will describe how
these clusters may
be used to provide the user with additional relevant documents or additional
relevant clusters.
As used herein, "topic" and/or "legal topic" shall mean a legal area, issue,
and/or subject
matter. An example of this is "search and seizure." A sub-topic and/or "legal
sub-topic" shall
mean a more granular classification of a topic and/or legal topic. Examples of
this are "search
and seizure ¨ traffic stop" and "search and seizure ¨ expectation of privacy."
A cluster shall
mean a set of documents grouped according to a topic that the documents hold
in common. An
example of a cluster is a group of legal documents relating to "search and
seizure." Although
typically a heterogeneous (containing more than one content type of document)
set of
documents, it is possible for a cluster to be a homogeneous set of documents
(i.e., containing
only one content type of document). A noun phrase is a word group that
contains a noun and its
modifiers. Examples of noun phrases are "product liability action" and "our
favorite restaurant."
A segment is a portion of a document that may be defined by the particular
topic it addresses.
By way of example, a court decision discussing and finding a party liable for
fraud and then
discussing damages is one document with two segments, namely "fraud" and
"damages." The
words pick, choose, select, identify, and all respective forms thereof, shall
be used
interchangeably.
Also, a document is "associated" with a cluster if it is relevant to the topic
of the cluster.
Further, a document is a "member" of a cluster if it is both relevant to the
topic associated with a
cluster and is important in the context of the topic. Still further, a first
document is said to be
"similar" to a second document if they share a sufficient number of features
such as noun phrases
and citation history.
Finally, it should be noted that there are many different types of legal
documents
including but not limited to case law, statutes, regulations, administrative
decisions, secondary
sources, briefs, pleadings, motions, memoranda, expert witness testimony,
court orders, scholarly
articles, and jury verdicts. Further, these documents arise in the federal,
state and/or local
context (e.g., a federal court opinion as opposed to a state court opinion).
Also, at least some of
these types of documents (e.g., non-court decision documents) may be
associated with notes of
decisions which serve as alerts to the legal professional accessing the
documents that the
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
document (or, e.g., the contents of the document such as a statute) has been
involved in
litigation. Some of these documents are primary authority and some are
secondary authority.
System Components and Structure
Figure 1 shows an exemplary online information-retrieval system 100. System
100 may
include one or more databases 110, one or more servers 120 (only one shown),
and one or more
access devices 130 (only one shown).
Databases 110 includes a set of primary databases 112 and a set of second
databases 114.
Primary databases 112, in the exemplary embodiment, include a case law
database 1121 and a
statutes databases 1122, which respectively include judicial opinions and
statutes from one or
more local, state, federal, and/or international jurisdictions. Secondary
databases 114 include an
ALRCD database 1141, an AMJURCD database 1142, a West Key NumberTM (KNUM)
Classification database 1143, and a law review (LREV) database 1144. Other
databases (not
shown) may include financial, tax, scientific, and/or health-care information.
Also, it should be
noted that primary and secondary may also connote the order of presentation of
search results
and not necessarily the authority or credibility of the search results.
Databases 110, which take the exemplary form of one or more electronic,
magnetic, or
optical data-storage devices, include or are otherwise associated with
respective indices (not
shown). Each of the indices includes terms and phrases in association with
corresponding
document addresses, identifiers, and other conventional information. Databases
110 are coupled
or couplable via a wireless or wireline communications network, such as a
local-, wide-, private-,
or virtual-private network, to server 120.
Server 120 is generally representative of one or more servers for serving data
in the form
of web pages or other markup language forms. This may be done with known
associated applets,
ActiveX controls, remote-invocation objects, or other related software and
data structures to
service clients of various "thicknesses." More particularly, server 120
includes a processor
module 121 and a memory module 122.
Processor module 121 includes one or more local or distributed processors,
controllers, or
virtual machines. In the exemplary embodiment, processor module 121 assumes
any convenient
or desirable form.
6
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
Memory module 122 takes the exemplary form of one or more electronic,
magnetic, or
optical data-storage devices. Memory module 122 is comprised of a subscriber
database 123, a
search module 124, a user-interface module 126, and a cluster module 128.
Subscriber database 123 includes subscriber-related data for controlling,
administering,
and managing pay-as-you-go or subscription-based access of databases 110.
Search module 124
includes one or more search engines and related user- interface components.
These search
engines receive and process user queries and/or other user activity against
one or more of
databases 110, including the primary databases 112 and the secondary databases
114. The
secondary databases may provide, for example, topical treatises, state
practice guides, statutes,
and/or law review articles to augment searches of case law database. User-
interface module 126
includes machine readable and/or executable instruction sets for wholly or
partly defining web-
based user interfaces, such as search interface 1261 and results interface
1262, over a
communications link 129 such as a wireless or wireline communications network
on one or more
accesses devices, such as access device 130.
Cluster module 128 includes machine readable and/or executable instruction
sets. Cluster
module 128 interacts, directly and/or indirectly, with the processor 121 and
other modules in the
memory 122. Cluster module 128 also interacts, directly and/or indirectly,
with the databases
110 via communications links 111 and with access device 130 via communications
link 129.
Access device 130 is generally representative of one or more access devices,
all of which
may simultaneously interact with the server 120. In the exemplary embodiment,
access device
130 takes the form of a personal computer, workstation, personal digital
assistant, mobile
telephone, or any other device capable of providing an effective user
interface with a server or
database. Specifically, access device 130 includes a processor module 131one
or more
processors (or processing circuits) 131, a memory 132, a display 133, a
keyboard 134, and a
graphical pointer or selector 135, such as a "mouse."
Processor module 131 includes one or more processors, processing circuits, or
controllers. In the exemplary embodiment, processor module 131 takes any
convenient or
desirable form. Coupled to processor module 131 is memory 132.
Memory 132 stores code (machine-readable or executable instructions) for an
operating
system 136, a browser 137, and a GUI 138. In the exemplary embodiment,
operating system 136
takes the form of a version of the Microsoft Windows operating system, and
browser 137
7
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
takes the form of a version of Microsoft Internet Explorer . Operating system
136 and
browser 137 not only receive inputs from keyboard 134 and selector 135, but
also support
rendering of GUI 138 on display 133. Upon rendering, GUI 138 presents data in
association
with one or more interactive control features (or user-interface elements).
(The exemplary
embodiment defines one or more portions of interface 138 using applets or
other programmatic
objects or structures from server 120.)
More specifically, graphical user interface 138 defines or provides one or
more display
regions, such as a query or search region 1381 and a search-results region
1382. Query region
1381 is defined in memory and upon rendering includes one or more interactive
control features
(elements or widgets), such as a query input region 1381A and a query
submission button
1381B. Search-results region 1382 is also defined in memory and upon rendering
includes a first
region 1382A, a second region 1382B, and a third region 1382C. Region 1382A
includes one or
more interactive control features, such as features Al, A2, A3 for accessing
or retrieving one or
more corresponding search result documents from one or more of databases 110
via server 120.
Region 1382A, in one embodiment, is the region from which a legal professional
may select a
legal document. Regions 1382B and 1382C are, respectively, regions for
displaying information
relating to the first cluster of legal documents and the second cluster of
legal documents. Such
information may include respective titles and/or citations for the
corresponding documents. For
each such documents and/or cluster, this information may be in the form of a
hyperlink or other
browser-compatible command input that provides access, ultimately, to the
documents and/or
cluster of documents via server 120 and databases 110.
System Operation
Figure 2 is a flowchart 200 corresponding to operation of the system 100 of
Figure 1.
Flowchart 200 includes blocks 210 through 270 which are arranged and generally
described
sequentially. However, those skilled in the art realize that other embodiments
of the invention
may execute two or more blocks in parallel using multiple processors or
processor-like devices
or a single processor organized as two or more virtual machines or sub
processors. Some
embodiments also alter the process sequence or provide different functional
partitions to achieve
8
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
analogous results. For example, some embodiments may alter the client-server
allocation of
functions, such that functions shown and described on the server side are
implemented in whole
or in part on the client side, and vice versa. Moreover, still other
embodiments implement the
blocks as two or more interconnected hardware modules with related control and
data signals
communicated between and through the modules. Thus, the exemplary flowchart of
Figure 2
(and elsewhere in this description) applies to software, hardware, and/or
firmware
implementations.
The remaining description in the System Operation section refers to figures 2
through 31
wherein figure 2 outlines the operation of the system 100 and figures 3A
through 31 are various
screenshots as seen from the perspective of a user (e.g., legal professional)
using a access device
130 to access the WestlawNextTM online information retrieval system.
As shown in block 210, the system 100 generates a signal that ultimately
causes a search
interface to be presented to a user. The signal is output from server 120 to
access device 130 via
communications link 129 and stored in memory 132. GUI 138 provides search
region 1381 on
the access device 130. It should be noted that this step assumes that the user
operating access
device 130 has already successfully logged into the system 100 by supplying an
interne-protocol
(IP) address for an online information-retrieval system and correct login
information (e.g., user
identification and password), via the access device 130 and communications
link 129, to the
system 100. An exemplary search interface screen 300 presented to the user is
depicted in
Figure 3A. The search interface 300 includes a query input region 310 in which
the user of
access device 130 may enter a search query by typing text and submitting the
query to system
100.
As shown in block 220, the system 100 receives the query, also known as a
search
request, and processes the request. To process the request, the server 120
communicates with at
least one database from databases 110 and identifies a set of legal documents
in response to the
search request. Next, the server 120, via the processor 121 and memory 122,
generates a signal
associated with the set of legal documents identified in response to the
search request. The
signal is transmitted over communications link 129 to access device 130. The
access device 130
displays a screen 320 to the user based upon this signal. Such a screen 320 is
depicted in Figure
3B. It should be noted that Figure 3B does not contain information (e.g.,
titles, words describing,
9
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
hyperlinks to, etc...) relating to a first cluster of legal documents and a
second cluster of legal
documents.
As shown in block 230, the system 100 receives another signal generated by the
user of
access device 130 via communications link 129. This signal is indicative of
the user accessing a
document from the set of legal documents provided in response to the search
request. Accessing
may be done in a variety of manners including but not limited to the user: (1)
viewing the
document on the access device 130; (2) printing the document; (3) emailing the
document; and
(4) setting up an alert with respect to the document. As shown block 240, the
processor 121 and
memory 122 begin to process this signal. This is done by identifying a set of
metadata
associated with the accessed or selected document. This set of metadata is
then used to pick a
first cluster of legal documents and a second cluster of legal documents as
shown in block 250.
The manner in which clusters are picked is by using a pre-computed set of
clusters associated
with each document. The association process, described in more detail in
Cluster Generation
section below, uses a combination of similarity measures between the document
and/or
document metadata and the cluster and/or cluster metadata. These measures
include statistics
(such as term-frequency and inverse document-frequency) regarding terms, noun
phrases, word
pairs, text, citations, associated queries, and other items. As shown in step
260, a signal relating
to these clusters is generated and transmitted from server 120 to access
device 130 via
communications link 129. Next, the access device 130 displays a screen 330 to
the user based
upon this signal. Such a screen 330 is depicted in Figure 3C. It should be
noted that the right
hand portion 331 of screen 330 is related to the clusters. It should be noted
that the right hand
portion 331 of screen 330 is analogous to regions 1382B and 1382C of Figure 1.
Also, portion
332 of screen 330 is analogous to region 1382A of Figure 1.
At this point, the user, who had originally search for "federal arbitration
act" (see query
input region 310 of Figure 3A), realizes that what is more interesting to the
user is a set of
documents relating to the topic entitled "Alternative Dispute Resolution" (see
generally the right
hand portion 331 of Figure 3C showing multiple clusters). More specifically,
the user wants
more information on the sub-topic entitled "Interstate Commerce Requirement of
[the] Federal
Arbitration Act." When the users clicks on the appropriate hyperlink relating
to the sub-topic, a
signal is sent from the access device 130 to the server 120 via the
communications link 129. As
shown in block 270, the server 120 receives and processes this signal by
identifying legal
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
documents associated with the topic and sub-topic "Alternative Dispute
Resolution/Interstate
Commerce Requirement of [the] Federal Arbitration Act." To process the signal,
the server 120
communicates with at least one database from databases 110 and identifies
legal documents
relevant to the sub-topic (based upon clusters and "sub-clusters"). Next, the
server 120, via the
processor 121 and memory 122, generates a signal associated with the legal
documents and
transmits it over communications link 129 to access device 130. The access
device 130 displays
a screen 340 to the user. Such a screen 340 is depicted in Figure 3D.
Figures 3E through 31 show another series of screen shots relating to the
invention.
Essentially they illustrate that another scenario under which signals relating
to multiple clusters
may be transmitted to an access device. It does not have to be initiated
solely in response to a
"word" or "text" search (as shown in input region 310 of Figure 3A). For
example, Figure 3E
begins with a user searching for a document associated with a particular
citation, namely 489
U.S. 468, a citation to a Supreme Court case.
Cluster Generation
Figures 4A through 4J disclose various algorithms, features and applications
for
generating and using clusters of legal documents. As discussed in detail
below, in one
embodiment, the cluster module 128 of Figure 1 defines and generates a cluster
by identifying
one or more legal issues among case-law documents, populates the cluster with
a rich spectrum
of legal documents based upon the cluster's legal issue, summarizes the
content represented by
the generated cluster, and provides various associations between generated
clusters and
documents, queries, and folders. Although the description below refers to a
Westlaw system
environment, one skilled in the art will appreciate that the disclosed
algorithms, features and
applications are applicable to other online legal research systems.
To identify one or more legal issues among case-law documents, the cluster
module 128
implements a bottom-up strategy. For example, in one embodiment, the cluster
module 128
identifies the legal issues inside one document, and then merges similar
issues together to form
clusters for all documents.
The cluster module 128 identifies legal issues using a Headnotes grouping
defined for a
case. For example, for cases deemed important on the Westlaw system,
Headnotes (e.g.,
editorial annotations) are added during the publishing process. Headnotes
provide a succinct
11
CA 02764316 2017-01-20
summary of a legal issue raised in the case and are also associated with one
or more Westlaw
Key NumbersTM, described below. An example of a Headnotes grouping with Key
NumbersTM
is shown in Figure 4A.
Advantageously, by grouping Headnotes based on their "similarities", the
cluster
module 128 identifies major legal issues inside a case. In one embodiment, to
determine
similarity, the cluster module 128 first computes several features from the
Headnotes and then
applies an agglomerative clustering algorithm. Exemplary similarity features
computed by the
cluster module 128 include a Key NumbersTM similarity feature, a Headnote text
similarity
feature, a KeyCite0 similarity feature, and a Common Noun Phrase frequency
feature.
The Key NumbersTM similarity feature is based on a Key NumberTM. West's Key
Number System() is a taxonomy defined on the Westlaw system that categorizes
legal topics
into a hierarchical structure. The cluster module 128 computes the similarity
between Key
NumbersTM based on the global co-existence of Key NumbersTM inside cases. In
one
embodiment, the cluster module 128 determines Key NumberTM topic commonality.
The Headnote text similarity feature is based on text describing a legal
issue. For
example, in the Westlaw system, each Headnote typically includes an amount of
text
describing a legal issue. The cluster module 128 computes the similarity
between two
Headnotes' text using wordpair features extracted from them. In one
embodiment, the cluster
module 128 uses a hybrid approach which combines the TF-IDFs (term- frequency-
inverse
document-frequency) and probabilities of wordpairs.
The KeyCiteg similarity feature is based on relationships between cases. In
the
Westlaw system, KeyCiteg data maintains citing and cited relationships
between cases
(several down to the Headnotes level). In addition, KeyCitee data includes
information
concerning the importance/authoritativeness of a case, and information
regarding similarity
among Headnotes (for example, if two or more Headnotes are co-cited together
in several
cases, they tend to discuss closely related legal issues). U.S. Patent Number
7,529,756 issued
on May 5, 2009 entitled "System and Method for Processing Formatted Text
Documents in a
Database" (filed November 22, 2000 and assigned U.S. Pat. Application Serial
No. 09/746,557)
and U.S. Pat. Application Serial No. 12/432,380 entitled "System and Method
for Processing
Formatted Text Documents in a Database" filed on April 29, 2009 describe
KeyCiteg in detail
12
CA 02764316 2017-01-20
The cluster module 128 computes the frequency of how often Headnotcs have been
co-cited in
other cases.
The Common Noun Phrase frequency feature is based on a noun phrase (NP) whose
head
is a noun or a pronoun, optionally accompanied by a set of modifiers. In the
Westlaw system,
NPs typically represent a legal term in a Headnote. The cluster module 128
computes the
frequency of two common NPs between Headnotes, which provides a measure of how
similar
Headnotes are at the "concept" level. In one embodiment, the cluster module
128 uses the NP
frequency feature as a supplement to the Headnote text similarity features,
since a NP may be
considered an n-gram for a particular value of n.
Once the cluster model 128 computes one or more similarity features between
Headnotes,
the cluster module 128 implements an agglomerative clustering algorithm to
group similar
IIeadnotes. For example, in one embodiment, the cluster module 128 merges two
IIeadnotes
together while maximizing the following equations,
T2
112 = maximize ¨
E1
where, T2 = maximizeZ cos(hi, Cr)
r=1 itiESr
= minimizeZn, cos(Cr, C)
r=1
EhESr
=
nr
ZheSr
Esrcs( isri J
C _____________________________________
in which T2 is the intra-cluster similarity and el is the inter-cluster
similarity. In these equations,
k being the total number of clusters, Sr being one of the k clusters, and S
being the collection of
all the clusters,
hi being one of the Headnote in the cluster Sr
Cr being the center of one cluster,
C being the center of all the clusters
nr being the number of Headnotes in the cluster Sr.
13
CA 02764316 2017-01-20
In one embodiment, the cluster module 128 scans through all the Headnote
feature vectors,
which is one common representation for a set of features used, and identifies
two feature vectors
which have the maximal T2 value. The cluster module 128 also computes the
value ei at
approximately the same time. The cluster module 128 stops the scanning
iteration when the
value el is less than a predefined threshold. The cluster module 128 stops the
scanning iteration
when the value of el is less than a predefined threshold. The range of the
threshold is between
0.0 to 1.0, and preferable, it is set to be 0.45.
Advantageously, by utilizing a predefined threshold, the cluster module 128
avoids
setting up the number of clusters for the data set in advance, which many of
the known clustering
algorithms require. The cluster module 128 applies this technique to cases
with Headnotes and
resulting topics are used in a cluster merging process described below which
produces clusters
for cases.
Once topics are determined, the cluster module 128 is configured to merge
similar
clusters. For example, legal topics detected in different cases using the
before-mentioned
techniques may be very similar, i.e., they are concerned with the same or
closely related legal
issues. By merging similar clusters together, the cluster module 128
partitions the legal space
into meaningful clusters.
In one embodiment, the cluster module 128 mergers clusters using a two step
process.
First, the cluster module 128 performs a candidate selection process.
The candidate selection process includes generating, training and applying
three different
CaRE indices to eligible topics. CaRE
stands for Classification and Recommendation
Engine. CaRE is described in detail in U.S. Patent No. 7,062,498 which issued
on June 13,
2006 entitled "Systems, Methods, and Software for Classifying Text from
Judicial Opinions and
other Documents" (filed on December 21, 2001 and assigned U.S. Pat.
Application Serial No.
10/027,914), U.S. Patent Number 7,580,939 which issued on August 25, 2009
entitled "Systems,
Methods, and Software for Classifying Text from Judicial Opinions and other
Documents" (filed
on August 30, 2005 and assigned U.S. Pat. Application Serial No. 11/215,715),
and U.S. Pat.
Application Serial No. 12/545,642 entitled "Systems, Methods, and Software for
Classifying
Text from Judicial Opinions and other Documents" filed on August 21, 2009.
14
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
In one embodiment, the cluster module 128 performs the following indexing
functions:
CaRE word-pairing indexing, CaRE Key NumbersTM indexing, and CaRE citation
indexing.
In CaRE word-pair indexing, the cluster module 128 associates each topic with
a
number of Headnote texts. The cluster module 128 computes word-pairs of the
text and indexes
them. The cluster module 128 retrieves a list of topics based on the
similarities between word-
pair profiles.
In CaRE Key NumbersTM indexing, the cluster module 128 associates each topic
with a
list of Key NumbersTM via Headnotes. The cluster module 128 then computes
indexed Key
NumberTM profiles. The cluster module 128 then retrieves a list of topics
based on the
commonalities between Key NumberTM profiles.
In CaRE citation indexing, the cluster module 128 links each topic to one or
more
cases, each case is further linked to other cases via KeyCiteCD information
(contain both citing
and cited information). The cluster module 128 also computes citation profiles
that are indexed.
The cluster module then retrieves a list of topics based on common citation
patterns between
citation profiles.
Advantageously, by aggregating the recommendations from the three generated
CaRE
indices, the cluster module 128 generates a list of candidates for each of the
topics.
Second, from the list of candidates generated from the selection process, the
cluster
module 128 determines for each cluster whether the cluster is "similar" to an
input topic, and
thus should merged with the topic. In one embodiment, for each topic
identified, the cluster
module 128 generates a query during the Headnotes grouping phrase described
previously. The
query can include noun phrases and Key NumbersTM. An example is shown in
connection with
Fig. 4B. From the query, along with the associated cases, the cluster module
128 determines
several features. Exemplary features calculated by the cluster module 128
include Noun Phrases
(NPs) similarity - which includes a global maximal score between pair-wise
NPs, mean of
maximal score between pair-wise NPs, percentage of common NPs, and percentage
of common
words, Key NumbersTM (KNs) similarity - which includes a Key NumberTM profiles
similarity
score, percentage of common KNs, and percentage of common KN topics, Co-
citation feature -
which describes the normalized number of documents cited by both associated
seed cases, and
Co-click feature, which calculates the normalized number of sessions that have
both associated
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
seed cases. The Co-citation feature describes the normalized number of
documents cited by
both associated seed cases, and is computed using the following formula:
cite(ci n ci)
cite-sim(c cJ.) = ______________________________
cite(ci U c1)
in which cite(ci n cj) is the count of other legal documents citing both seed
cases ci and cj.
Also, cite(ci U ci) is the count of legal documents citing either seed cases
ci or cj.
The co-click feature calculates the normalized number of sessions that have
both associated seed
cases and is be computed using the following formula:
click(ci n ci)
coclick_sim(ci, ci) = ____________________________
click(ci U cj)
in which click(ci n ci) is the count of sessions in which both seed cases ci
and cj were clicked.
Also, click(ciu ci) is the count of sessions in which either seed cases ci or
cj were clicked..
In one embodiment, the cluster module 128 uses these generated features to
train a
support vector machine ("SVM") ranker model. SVMs and ranking is well known in
the art. In
order to provide target data for the training of the model, the cluster module
128 generates a set
of "silver" preference grades automatically that measure overlaps between
recommended cases
from the queries through a search engine process. In order to provide target
data for the training
of the model, the cluster module 128 generates a set of "silver" preference
grades automatically
by measuring the overlaps between retrieved cases using the queries associated
with the clusters
through a search engine process. The search engine is described in detail in
U.S. Patent
Application No. 11/538,749 filed October 4, 2006 entitled "Systems, Methods,
and Software for
Identifying Relevant Legal Documents" (now Publication No. U.S. 2008/0033929
Al). By
ranking the scores of the candidates using the features via the SVM model, the
cluster model 128
generates a cluster by merging selected candidates with a seeding topic based
on the ranked
scores. A list of clusters can then be produced by exhaustively repeating this
process for each of
the topics such that one is either merged with other topics or becomes a
seeding topic.
Once the list of clusters is selected to be merged, the cluster module 128
generates
labels. A label displays the "aboutness" of a cluster and reflects a summary
of the content inside
the cluster. The content of a populated cluster can include cases, statutes,
regulations,
administrative decisions, analytic materials, briefs, expert witness
testimony, jury verdict reports,
state trial court orders, pleadings, motions and memoranda as well as other
legal documents.
16
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
Furthermore, the cases and some of the other documents will also include
Headnote texts and
Key NumbersTM. The catchline of a Key NumberTM is a short description of a
defined legal
topic, and it is hierarchically structured such that the first portion is
often referred to as the Key
NumberTM topic, such as "Negligence" in figure 4C, and subsequent portions are
often referred
to as Key NumberTM sub-topics, while the last portion is often referred to as
the leaf level.
In one embodiment, the cluster module 128 generates a hierarchical label
structure that
includes a topic, optional sub-topic, and a noun phrase from cases. The topic
and sub-topic parts
are derived from Key NumberTM catchlines, which are precise and hierarchically
structured
phrases describing various legal issues. The noun phrase is selected from
Headnote texts inside
a cluster. Examples of a cluster label is shown below wherein the bold
portions represent the
topic and sub-topic, and the italic portion is the NP.
Securities regulation/state regulationlinvestment contract security
Schools/Teachers' Duties and Liabilities/Governmental Immunity
Typically, a cluster contains a certain number of Key NumbersTM, typically
those
assigned to the Headnotes contained in the cluster. To generate the topic and
sub-topic portion of
the label, the cluster module 128 computes a frequency of the Key NumbersTM
which results in
major topics included in the cluster being determined. Once a major topic has
been identified,
the cluster module 128 traverses the catchlines among Key NumbersTM in the
major topic to
determine a sub-topic. In one embodiment, the cluster module 128 traverses the
catchlines until
a divergence is detected based on a majority voting scheme. An example of
label generation for
topics is shown in connection with Figure 4C wherein the label is shown in box
410. An
example of a majority voting scheme is one where the top n post-divergence sub-
topics are
considered (where n might be, for example 7) and which selects the sub-topic
that is the most
frequently occurring within the candidate set.
The cluster module 128 generates the noun phrase portion by extracting all the
Headnote
texts inside a cluster. In one embodiment, only those Headnotes in the major
Key NumberTM
topics are selected by the cluster module 128 for this process. Several
features are derived for
each of the noun phrases, and the top scored noun phrase (NP) is selected by
the cluster module
128 as part of the label. For example, in one embodiment, the several features
include the length
of the NP, the term frequency of the composite NP, the term frequency of the
NP's terms
17
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
considered jointly, and the TF-IDF score using normalized TF, As used above,
TF stands for
term frequency within the given document, DF stands for document frequency
within the given
collection, and IDF stands for inverse document frequency or the reciprocal of
the document
frequency. In the given embodiment, weights are determined for this set of
features so as to
optimize the performance of the label selection process based on empirical
evidence from a label
grading process. It is also worth noting that for NP scoring and selection
purposes an NLP
simplified version of the extracted NPs are used (stopped, stemmed, etc.), By
contrast, for
presentation purposes, a canonical (original) form of the NP is used for user
readability.
Because of the importance of each cluster possessing a label that is unique
across the set
of clusters, two types of uniqueness (or duplication) checks are performed. In
order to apply
these checking processes to the entire cluster set, the clusters are first
ranked by a fitness
function that relies on many factors including but not limited to the number
of initial cases in the
given cluster, and additional features such as the popularity of the cases in
the cluster (based on
citations and based on user selection), the number of jurisdictions
represented, the average age of
the cases in the cluster, and the average age of the Key NumbersTM in the
cluster. Such a fitness
function effectively enables one to rank the clusters by a quality metric.
Once the clusters are ranked according to the fitness function, the labeling
process is
applied to the highest quality cluster first, then the next highest, etc....
At the same time, the
resulting labels are recorded and if a given label has already been assigned
to a previously
processed cluster, the candidate label is rejected in favor of the next
candidate label that has not
been previously assigned. Similarly, a semantic representation of each label
is recorded, and
each candidate label is also assessed for its semantic uniqueness. If a highly
semantically
similar label has already been assigned, a label can be rejected for a less
semantically similar
label. Processing for this semantic comparison process includes basic
natural language
processing such as stopping, stemming, term deduping, etc. A threshold may
also be invoked
such that if the core constituent tokens in two labels being compared are 80%
similar, they are
considered semantically similar, and the candidate will be rejected in favor
of the next candidate
that is not found to be semantically similar using this threshold.
Once all of the clusters are identified and defined in the legal space by the
cluster module
128, various legal documents are associated with the predefined legal
clusters. For example,
when a legal document is presented for display in an online legal research
system, such as
18
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
Westlaw , all the legal topics discussed in the document are automatically
identified and
associated with related clusters, which can relay all related cases, statutes,
regulations, and other
documents that discuss the same legal issues as in the original document. To
relay all the related
documents, in one embodiment, the cluster module 128 applies the search engine
process as
described in Publication No. U.S. 2008/0033929 Al using the generated query of
a cluster. The
query of a cluster comprises of a number of noun phrases and key numbers. The
selection of the
noun phrases and key numbers are based on their importance to the defined
legal topics using the
similar features as in the labeling process. By adding key numbers into the
query of a cluster, the
cluster module 128 can tailor the search engine to retrieve the most relevant
cases, statutes,
regulations, and other documents either online (in real-time) or offline (pre-
population).
An example workflow of document cluster association is shown in connection
with
Figure 4D. As shown in Figure 4D, in one embodiment, for an incoming document,
a list of
legal topics described in the document is determined by the cluster module
128. For each topic,
a list of similar clusters is associated and recommended.
Depending on the metadata available, four different techniques are implemented
by the
cluster module 128, as illustrated in Figure 4E. For documents with Headnotes
defined (cases,
some administrative decisions and briefs), the cluster module 128 process
operates similarly to
the process described in connection with finding legal issues via Headnotes
grouping discussed
above.
For some statutes and regulations that include attached notes of decisions
(NODs) - a
compilation of cases that construe or apply the Statutes or Regulations, the
NODs are detailed
into the Headnote level for each of the case. As such, the cluster module 128
identifies Key
NumberTM information from them. The cluster module 128 then groups these Key
NumbersTM
based on their catchlines such that Key NumbersTM with the most common sub-
topics are
grouped into one group. An example of Key NumbersTM grouping is shown in
connection with
Figure 4F.
As shown in the Figure 4F example, five (5) Key NumbersTM are shown and after
grouping by the cluster module 128, three Key NumbersTM 197K201, 197K202, and
197K203
are grouped into one group since they have common sub-topics up to 1971(A)1
("Nature of
Remedy in General"), and two Key NumbersTM 197K912, 197K913 are grouped in
another
19
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
group for the sub-topic 197V ("Suspension of writ"). These grouped Key
NumbersTM define the
topics of the document.
For documents with citing documents and no Headnotes or NODs, the cluster
module
128 incorporates two pieces of information into its method: one is from all
the Key NumbersTM
of the cited cases and another is from Key NumbersTM suggested by CaREKNATM
using the
document text. KNA stands for Key NumberTM Assignments. The cluster module 128
groups
these two sets of Key NumberTM by their topics and then sorts them based on
topic popularity.
The Key NumbersTM from the cases side with the highest topic popularity that
agree with the
Key NumbersTM from the CaREKNATM side describing the topic level are selected
by the
cluster module 128 to generate legal topics for the document. Groupings,
similar to those shown
in Figure 4F, are made by the cluster module 128. One grouping is from all the
Key NumbersTM
of the cited cases and another is from Key NumbersTM suggested by CaREKNATM
service using
the summarized document text. In one embodiment, the summarized document text
comprises
the first 2,000 characters of the document. Those skilled in the art will
realize there are other
methods for generating summaries of legal documents. Examples of such methods
may be found
in Schilder, F. and Kondadadi, R, FastSum: Fast and accurate query-based multi-
document
summarization as contained in the proceedings of the Joint Annual Meeting of
the Association
for Computational Linguistics and the Human Language Technology Conference
(ACL-HLt
2008), pages 205 ¨ 208, Columbus, Ohio, June 2008. CaRE-KNATM is a Key Number
AssignmentTM service built upon the CaRECD indexing system using the
collection of the Key
NumbersTM with their corresponding Headnote texts. It can recommend the most
relevant Key
NumbersTM based on an input query text. The cluster module 128 groups these
two sets of Key
NumbersTM.
For documents with no meta-data but text, the cluster module 128 applies a
CaRE-
KNACD service to suggest Key NumbersTM based on the text. The Key NumbersTM
with the
highest topic popularity are then used by the cluster module 128 to perform
tasks similar to those
shown in Figure 4F to generate legal topics for the document.
After each legal topic has been identified by the cluster module 128 for a
document, the
cluster module 128 associates each document with the pre-defined legal
clusters based on its
similarity.
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
For example, in one embodiment, the association candidate selection process
executed by
the cluster module 128 is similar to the candidate selection process for
merging clusters
described previously. In particular, for all the clusters which can be
associated to the topics in
legal documents, the cluster module 128 generates the three CaRECD indices
based on the word-
pair features, Key NumberTM profiles features, and KeyCitek citing/cited
profiles features.
For each topic, the three sets of features described previously, namely the
word-pair
features based on the Headnote text, the Key NumberTM profiles features, and
the KeyCitek
citing/cited profiles of the seeding document, are calculated by the cluster
module 128 and sent
to the CaRECD indices. Each CaRECD engine is used to retrieve its independent
suggestions,
which aggregated later to form a list of candidates to be associated. Figure
4G shows an example
flowchart of the association candidate selection process. Next, the cluster
module 128 computes
a list of features, as shown and described in connection with Figure 4H. Next,
the cluster module
128 applies a SVM ranker to these computed features. The cluster module 128
then selects the
top scored candidates as the associated clusters to the topics. Figure 41
shows a flowchart
illustrating this process.
In one embodiment, the cluster module 128 associates sets of documents stored
in folders
with a set of document recommendations that address the same legal issue(s)
which are relevant
to the original document set. For example, in the Westlawk system, a "Research
Folder" is a
place where a user can store together one's documents of interest. The
research folder can
contain various numbers of documents and various document types. This folder-
based document
recommendation method executed by the cluster module 128 identifies common
topics (legal
issues) among these foldered documents and proceeds to return additional
relevant documents
that discuss the same topics.
For example, in one embodiment, input to the method is a list of documents,
such as
cases, statutes, and regulations found in a folder box 480 of Figure 4L. The
output of the method
is a list of additional documents addressing the same distinct legal issues.
The method involves
two steps. The cluster module 128 first detects topics and then retrieves the
additional
documents which share the same legal issues as shown in functional box 481 of
Fig. 4L.
In the topic similarity detection step, which takes place in functional box
482, the cluster
module 128 uses the relationships among the documents in a folder to find
additional relevant
21
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
documents. Advantageously, instead of utilizing the document content itself,
which may be
computationally expensive, these relationships identified in the document
metadata are exploited.
These document relationships are quantified by a similarity matrix based on
two sources
of information. One is the cluster memberships of the documents in the
folders. The second is
the citation information associated with the documents, citing as well as
cited citation
information.
The dimension of the similarity matrix is n x n, where n is, for example, the
number of
legal cases. Such a matrix could also include other document types such as
legal briefs, for
example. Each entry of the matrix, au, is the similarity score of the document
in row i and the
document in column j. In the typical embodiment, the matrix is sparse (that
is, the majority of
the entries have 0 values). This property allows for an efficient storage of
the entries in a
database.
The matrix is computed offline and the results (entries) are stored in a
database (which is
part of 482 of Figure 4L. On the online side, an item-based top-N ranking
algorithm is used in
functional blocks 483 and 484 to "recommend" the top N documents in response
to the
documents stored in folders 480.
In practice, one matrix is generated based on document cluster memberships and
another
matrix is generated based on document citation information. Since both cluster
membership and
citation information can be used, there can be two scores that exist between
two documents.
Once these matrices have been generated offline, on the online side, the
inputs of the
recommendation algorithm are the document identifiers of the source documents
(from the given
folder) along with other useful metadata, for example, the jurisdictions of
the documents. In one
embodiment, where two sources of information are used, two recommendation
algorithms are in
fact run, one based on membership, the other based on citation information.,
ranking results from
both and then combining these results. Below is a set of pseudo-code for the
recommendation
algorithm based upon cluster memberships.
Recommend docs based on memberships
Input: f1,f2,f11, jurisdiction (f, is a case in the target folder i), n>2
22
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
Steps:
1. Get dk (the kth document) for each s, (the similarity score) where
siml(sõsj)>=0.05
in the same jurisdiction
If sim1(dk,s,)>0.05, count(dk)++
2. Compute w(s,), which is the pagerank of s, based on the graph with
source cases
only
3. Compute score(dk), if count(dk) is not less than T (an empirical
threshold of 500)
T = max(M, M/2+2), M is the number of s, whose w(s,)>0
score(dk)=count(dk)x L(simi(dk,si)x w(si))
sz
4. Rank dk based on the score from step 3
Output: a list of top k cases (k<=10)with its ranking
In another embodiment, one could make recommendations in response to documents
in a
folder that are in fact clusters rather than documents. Such an embodiment may
aggregate
clusters among those associated to the documents in the folders, and may
assign each cluster a
combined score (defined as Scam). The cluster module 128 then sorts these
scores in
descending order. The cluster module 128 computes the combined score Scam
based on topic
scores STp (rank can be implied as well, defined as RTp), cluster scores SAC
(rank can be implied
as well, defined as RAC), and the frequency count (defined asf).
For example, referring to Fig. J, one folder includes three documents (Doc 1
450, Doc 2
460, and Doc 3 470). The first document (Doc 1 450) includes two topics (Topic
1 451 and
Topic 2 452), and the second and third documents include one topic (Topic 1
461). Further, each
topic is associated with two clusters (Clu N). The combined score SCOMB
computed for Clu 1 is a
combination of frequency count (f=2 since the Clu 1 is associated in two
topics), scores from
topics STp (two scores, one from Doc 1->Topic 1, and another from Doc 2->Topic
1), ranks from
topics RTp (rank is inferred by the scores and normalized based on a power
based function,
clusters from lower ranked topics use a lower weight), scores from association
SAC, and ranks
from association RAC (similar normalization is applied). The cluster module
128 computes the
combined scores SCOMB for other clusters in a similar fashion. In one
embodiment, the cluster
module implements the following formula to compute the combined score SCOMB
for each of the
clusters,
23
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
X SAC, x B(RAC-1))
S COMB = 1( S Tp, X B(RTP,-1)
where B is a constant, i is the ith cluster, and Rx = 1,2,.... Preferably, B
is equal to 0.9. An
example of output generated by the cluster module 128 after topic detection is
shown in Fig. J.
In the topic consolidation step, the cluster module 128 condenses the
aggregated clusters
into groups such that each group contains highly "similar" clusters, and a
representative cluster is
selected to for each of the group.
For example, in one embodiment, the cluster module 128 scans through the
ordered
clusters list and performs a pair-wise similarity comparison between clusters
using the
information extracted from their queries, namely the NPs and the Key
NumbersTM. For clusters
with similarity scores above a certain thresholds, the cluster module 128
merges those clusters
into a single group. In one implementation using a range of similarity scores
from 1 through 5
(with 5 being the most similar), the threshold is 2.7. The cluster module 128
then selects the
cluster ranked highest in the ordered list (from the topic detection step
described previously with
reference, in part, to Figure 4J) to be the representative of the group. The
cluster module 128
computes the score of the selected group as the sum of the scores of the
clusters in the group.
The remainder of the clusters in the group are not visible as output of the
algorithm. After the
comparison is complete, the cluster module 128 sorts the cluster groups by
group score in
descending order.
Example output of the topic consolidation step is shown on the right side of
Figure 4K.
As shown in Figure 4K, in the output, the cluster module 128 grouped clusters
Clu 1, C1u2 and
C1u5 together, as these clusters were determined to be similar. Also, clusters
Clu 3 and Clu 4
were determined to be similar and thus grouped together. The cluster module
128 then uses the
Clu 1 cluster as being representative of the first group, and cluster Clu 3 as
being representative
of the second group. Clusters Clu 2, Clu5 and C1u4 are not made available in
the output.
The cluster module 128 also provides a query to clusters association. The
method of
query to cluster association used by the cluster module 128 is similar to the
process described in
connection with documents having no meta-data. In this case, the query is
considered the text.
Reference may be made to Figure 4F and associated description for a more
detailed explanation.
24
CA 02764316 2011-12-01
WO 2010/141477 PCT/US2010/036913
While the various sections of the detailed description above are intended to
illustrate and
teach ways of practicing the current invention, those skilled in the art will
appreciate that the
invention is not limited to the detailed description. For example, the
invention may be used in
other information solutions environments relating to, e.g., financial
information, health
information, tax and accounting information, scientific information and/or
combinations of the
same. Thus, the scope of the invention is defined by the claims below and
their equivalents.