Note: Descriptions are shown in the official language in which they were submitted.
CA 022~1043 1998-10-07
WO 97/38378 rCT/US97/05782
METHOD OF ORGANIZING INFORMATION RETRIEVED FROM THE INTERNET
USING KNOWLEDGE BASED REPRESENTATION
Terhni~l Field
This invention relates to the field of acc~ssing information on the Internet and, more
particularly, to a method of or~ni7in~ information retrieved from the Internet using a knowledge
based representation system.
Back~round of the Invention
The Internet is a series of inter-connected networks which facilitate the exchange of
10 information, data, and files. Users connected to the Internet have access to the vast amount of
information on these networks. A typical way of getting access to the Internet is through an
online service server. Referring to FIG. 1, networks 110, 112, and 114 are connected to Internet
100 via online service servers 120, 122, and 124, respectively. Another way of getting access to
the Internet is through a dial-in Internet provider. For example, a user on his personal computer
15 ("P.C.") 158 may access Internet 100 by dialing in to Internet provider 150 using his modem
152. Routers, which connect computers and networks, direct traffic in a network and on the
Internet. Routers 160, 162~ 164, and 166 examine packets of data that travel across the networks
and Internet to deterrnine where the data is headed.
Online service servers and Internet providers allow users to search the World Wide Web
("Web"), a globally connected network on the Internet, using software programs known as search
engines 130, 132, 134, and 154. Search engines are also known as search tools and Web
crawlers. These search engines travel across the Web gathering documents by following the
hypertext links found in Web (home) pages 140, 142, 144, and 156.
One way of searching the Internet is by keywords. For exarnple, a user types in a query
2~ string of keywords that describes the information he is looking for. The search engine searches
databases on the Internet and results are returned in hypertext markup language ("HTML")
pages. A user can then vie-v a document of interest by "clicking" on a link to that document.
Clicking refers to the process of actuating a mouse switch by centering a cursor on the desired
item.
I
CA 022~1043 1998-10-07
WO 97/38378 PCTNS97/05782
-
While present search engines provide for searching of keywords on the Internet~ the vast
amounts of information on the Internet makes getting relevant inforrnation difficult. Stated
another way, keyword searches typically result in a return of vast amounts of information that the
user must browse through in order to retrieve the relevant information. Thus~ what is required is
a more effective method of retrieving information from the Internet.
Summary of the Invention
The above-stated problem of org~ni7ing information search results is mitigated by the
application of knowledge based representation techniques for automatically categorizing search
results. This information retrieval and management system associates a knowledge base with
search servers to improve the relevance and precision of search tasks. The knowledge base
provides a user profile (topic taxonomy) that reflects the interests and preferences of the user for
org~ni7ing information. The system uses this knowledge base to organize the results of keyword
searches. The system automatically categorizes and segments search results in accordance with
the knowledge base to provide for easy searching of relevant information. The system displays
the search results over a subset of the knowledge based topic taxonomy, segmenting the results in
a way that makes it easy to find the most relevant documents, and filtering out irrelevant results.
Brief Description of the Drawin~s
In the drawing,
FIG. 1 illustrates a diagram of computers and networks and their connection to the
Internet for discussion of the environment in which the present invention operates;
FIG. 2 is a block diagram of an exemplary knowledge based browser displaying a
graphical representation of a concept generalization taxonomy in accordance with the principles
of the present invention:
FIG. 2a is an actual screen display of the exemplary knowledge based browser of FIG. 2;
FIG. 3 is a block diagram of a search interface in accordance with the principles of the
present invention; and
FIG. 4 shows a flow diagram illustrating the steps required for a user to retrie~e
information from the Internet and organize it using knowledge based representation
CA 022~1043 1998-10-07
WO 97/38378 PCT/US97/05782
Detailed Description
Referring to FIG. 1~ there is shown an environment for the present invention including
exemplary networks 110, 112, and 114 and P.C.'s 158 and 159 which are inter-connected to
Internet 100. These networks comprise users who are connected to one another in, for example,
a token ring network (network 114) or through an Ethernet network (networks 110 and 112).
Each network further comprises a server 120, 122, and 124. A server is a host computer that
allows users to communicate with each other on the network or with users outside the network
through the Internet. Users on P.C.'s 158 and 159 may subscribe to Internet Provider 150, which
allows users to communicate with each other and other users on the Internet.
Any user may search for information available on the Internet. If a computer or network
is connected to the Intemet, then information on that computer or network is accessible by others
if it is not protected. Since the Internet is a global network, the amount of information that can
be retrieved is imme~e. Many servers and providers include search engines 130, 132, 134, and
154 that allow users to search by keywords. These search engines are computer programs which
15 are search-application based programs that run on online service servers 120, 122~ 124, and
Internet provider 150. Searching by keywords typically results in a return of vast arnounts of
information that the user must browse through in order to get the desired information.
Currently, there are two ways of searching the Internet. Both methods operate under a
client/server model. By client/server model is intended a user running a piece of software on his
20 computer or a shared program of a server--the client--to use the resources of a distant server
computer (other servers connected on the Internet). For example, in FIG. 1, a user on P.C. 11 Oa
may search for information on online service servers 122 and 124 and Internet provider 150.
Similarly, a user on P.C. 156 may search for information on online service servers 120, 122, and
124. The distant servers, e.g., online service servers 120, 122, and 124 and Internet provider 150,
25 are also called hosts because they serve many users of many networks. The hosts allow many
different clients to access their resources at the same time; the hosts are not devoted to a single
user.
The first way of searching the Internet is through indexes. Indexes present a highly
CA 022~1043 1998-10-07
WO 97/38378 PCT/US97/05782
-
structured way of finding information. Indexes let users browse through information by
categories such as arts, computers, entertainment, sports, etc. In a Web browser, a user on his
P.C. I I Oa can click on a category by, typically, using his mouse 11 Ob and is presented with a
series of subcategories. For example, under sports a user may find baseball, basketball, football,
etc. Depending on the size of the index, there may be several layers of subcategories. When the
user gets to the subcategory he is interested in, he will be presented with a list of relevant
documents. To get to those documents, the user clicks on links to them. "Yahoo!" is the name
of a popular index on the Internet. Yahoo! and other indexes also allow users to search through
them by typing in words that describe information that the user is looking for. The user then gets
a set of search results--links to documents that match his search. To get the information, the user
clicks on a link to the document.
The second way of finding information is to use search engines, also known as search
tools. Search engines operate on essentially static pre-built indexes, i.e., the indexes are built up
from online content and stored in a ~l~t~ba~e on a search server. Web crawlers are used by the
search engines for gathering the online content that is retrieved and indexed in the search server's
database. Some popular Internet search engines include Lycos, WebCrawler, and Alta Vista. To
begin a search, a user types in keywords that describe the information he wants. Results that
match the user's search criteria from the search are sent back to the user. From the list of results,
the user can retrieve a document by clicking on a link to that document.
Although both indexes and search engines allow users to find information on the Internet,
the information found is typically large and often difficult to locate relevant information.
Therefore, it is desirable to automatically categorize search results found on the Internet so as to
allow users to easily browse through the search results to find relevant information.
According to the present invention, knowledge based representation systems, with their
capabilities for representing and inferring relationships arnong objects, mitigate the above
problems. In particular, the present invention is directed to a knowledge based infor~nation
retrieval and management system that enhances searches on any multi-network system such as
the Internet. The system provides users with means to superimpose a tailored conceptual
CA 022~1043 1998-10-07
WO 97/38378 PCT/US97/05782
org~ni7~tion over the information found on the Internet, thereby enriching the usefulness of and
access to that information. Referring to FIG. 1, the system is integrated with existing Web
browsers 130, 132, 134, and 154 to create a searnless environment combining hypertext browsing
with conceptual navigation. The system may also be stored on a personal computer, e.g., P.C.
5 11 Oa, in which case only users with access to that personal computer may use the system.
Referring now to FIG. 2, it illustrates an exemplary knowledge based browser which
displays a graphical representation of a concept generalization taxonomy 200 in accordance with
the present invention. A taxonomy is a generalization hierarchy which graphically displays
relationships between concepts. A concept is an abstract description of an object. Nodes in FIG.
2 correspond to knowledge base concepts (e.g., 210, 220, 230, 212, 214, etc.), and edges (e.g.,
21 Oa. 21 Ob, 220a, etc.) connecting the nodes indicate subsumption relationships between the
concepts. A feature of the present invention is the system can manage the subsumption
relationships automatically based on concepts and instances (270, 280). An instance is a specific
realization of a concept, i.e., a concept is an abstract description of something while an instance
of that concept is a real object that satisfies that description. For example, when a new document
is added to the knowledge based browser as an instance, the system infers all the places it
belongs in the taxonomy.
As illustrated in FIG. 2. the most general concepts are at the left. Following outgoing
edges of a concept node (going from left to right) leads to more specialized concepts. For
example, the topic "artificial intelligence" 228 is a specialization of "computer science" 220, and
"knowledge representation" 229 is in turn a specialization of "artificial intelligence" 228. The
panels 270 and 280 within this display show lists of in~tanres of these concepts. For example,
the panel 270 shows documents which are instances of the topic "pediatric medicine" 212; the
panel 280 shows instances of the concept "knowledge representation" 229. Instances are
inherited by parent concepts all the way up the hierarchy, so for example, the documents
appearing under "knowledge representation" would also appear under "computer science". The
method of organi7ing instances is discussed below with regard to a search interface. FIG. 2a is
an actual screen display of the exemplary knowledge based browser of FIG. 2, illustrating the
CA 022~1043 1998-10-07
WO 97138378 PCTIUS97/05782
-
concept generalization taxonomy 200 and the subsumption relationships between concepts and
instances.
The search interface operates similarly to that of the knowledge base browser. The search
interface uses a knowledge base to refine search results by segmenting and categorizing results
with respect to a user's concept generalization taxonomy. For example, after results from a
keyword search have been combined in a result set for display, the system provides an additional
refining step that can further focus the result set. Refining the result set against the knowledge
base involves retrieving the documents in the result set and processing them with the knowledge
base pattern matchers. Textual patterns associated with concepts in the knowledge base allow
the knowledge representation system to categorize and organize these documents within the
concept taxonomy. Each pattern in the knowledge base is associated with a concept. Stated
another way, each document is compared against these pattern m~tçh~rs to determine whether
there are any concepts that match the document. The output of this comparison process is a set
of specific concepts in the knowledge base that have some correspondence to the content of the
document. A record of a match between a concept and the document is made in the knowledge
base by creating a temporary instance whose description includes the matched concepts. Finally,
the refined search result is presented graphically over a subset of the knowledge base topic
taxonomy. This subset is defined by those concepts having one or more of the temporary
instances created during the m~tching process. This is illustrated in FIG. 3 where only those
concepts that match the contents of a document are displayed.
The present invention of using a knowledge based representation system in org~ni7ing
data is especially helpful when a keyword search results in thousands of documents. By running
pattern matchers against those docum~ntc, one can ~uickly narrow down those documents that
are most relevant to the user.
Accordingly, the knowledge based representation system (browser and search interface)
of the present invention allow users to quickly find relevant inforrnation.
Another feature of the taxonomy is that by grouping the results according to concepts, a
user may zoom in on the part that he thinks is most relevant. This further enhances searching on
CA 022~1043 1998-10-07
WO 97l38378 PCT/US97/05782
the Internet by saving browsing time.
The search interface further implements ~Idns~arent~ concurrent access to multiple index
servers in order to maximize query coverage and minimi7e response latency. By explicitly
representing the capabilities of the individual search engines, the query system ensures that only
5 those index servers capable of h~ntlling the query are consulted.
Another feature of the present invention is a user interface which provides editors for
extending and reorg~ni7ing the concept hierarchy. The user interface also provides for a
navigation browser that m~int~in~ an interactive graphical map of the navigation history. The
navigation browser is a tree-structured graphical representation of the user' s browsing history.
10 Its function is as follows: as the user browses, he generates an ordered sequence of the web sites
he visits, following links from one page to another- As he backtracks and makes new browsing
choices, the browsing history becomes a br~nchin~ tree. The navigation browser keeps track of
these choices adding new nodes to the tree for every site/page visited. This tree, besides showing
an overview of the browsing history, becomes an alternative way to navigate (by clicking on the
15 node in the tree to return to the associated page).
Another feature of the present invention is that the system architecture separates the
knowledge base from the client to allow the user to m~int:~in a consistent view of his information
space regardless of the client's location. By keeping the knowledge base in one place, the
environment can follow the user from one platform to another. An advantage of the separation is
20 to help ensure continuous availability of the system server since it provides shared access to the
knowledge base and performs autonomous monitoring tasks even when the client is inactive or
disconnected. In other words, the knowledge base may be stored on another server, separated
from the client.
Referring to the flowchart in FIG. 4, this flowchart illustrates the steps required for a user
25 to retrieve information from the Internet and organize it using a knowledge based representation
system in accordance to the present invention.
In step 401, a user enters a query string of keywords to be searched on his personal
computer I I Oa using a knowledge based Web browser 130 in accordance with the present
CA 022~1043 1998-10-07
WO 97138378 PCT/US97/05782
invention. The knowledge based Web browser is a software that may be installed in either a
client 110a or server 120.
In step 403, the query string is pre-processed to determine which search servers are
capable of understanding the query syntax. This is done by ex~mining the Universal ~esource
Locator ("URL") of the query string to determine which server(s) to send the request to.
Generally, the query has to be tr~n~l~tecl into specific query syntax of the server that the user is
requesting information. Typically, a query translator is provided with an interface to the server
for serving the query.
In step 405, queries are sent to each server that can handle the expression. Queries may
l O be sent out serially or concurrently. An advantage of sending out the queries concurrently is
reduction of latency in both the network and search process. In other uords, all servers can work
on a query at the same time.
In step 407, depending on the result size threshold, individual servers may need to be
queried repeatedly in order to gather the specified number of matches. Most servers, in order to
limit the amount of resources that are used for a given query, will break the results coming back
into some reasonable sets that are returned. For example, if there is a hundred hits for a search, a
server may be set up to return only ten hits at a time. As such, if the specified number of
matches is reached, then the procedure proceeds. If the specified number of matches has not
been reached, then the servers are repeatedly queried until it has been reached.In step 409, the results that come back from the servers are merged into a single result set.
The results are merged by removing duplicates of the results. Each item in the result set consists
of a reference to a document (a URL) and possibly a single line of descriptive text.
In step 41 1, if the user desires further refinement of the result set. he can request that the
results be compared against the knowledge base pattern matchers. Else, the result set is
displayed for the user.
In step 413, the document for each reference in the result set is retrieved.
In step 415, the pattern matcher(s) is applied to the document text to determine whether
there are any topic concepts that match the text.
CA 022~1043 1998-10-07
WO 97138378 PCT/US97/05782
In step 417, a list of topic concepts that match the text of the document are generated.
In step 419, an instance is created for each document that matches a concept.
In step 421, the instance for the document is classified in the knowledge base's topic
taxonomy.
The above iteration, steps 413-421, is parallelized to minimi7P the effects of network
latency in gathering the text, since the result set may contain dozens or hundreds of documents to
retrieve.
As the documents are retrieved and classified, the system incrçment~lly displays the post-
processed results graphically over a subset of the topic taxonomy, where the subset is def1ned by
10 the collection of concepts having one or more instances from the search result. This is done to
categorize and segment the search result with respect to concepts that are f7ltnili~r and
meaningful to the user. As such, by using the knowledge based representation system of the
present invention, the search result may be browsed at various levels of detail, depending on how
specific one wishes the segments to be.
What has been described is merely illustrative of the application of the principles of the
present invention. Other arrangements and methods can be implemented by those skilled in the
art without departing from the spirit and scope of the present invention.