Language selection

Search

Patent 1276728 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent: (11) CA 1276728
(21) Application Number: 553603
(54) English Title: INFORMATION RETRIEVAL SYSTEM AND METHOD
(54) French Title: SYSTEME ET METHODE D'EXTRACTION D'INFORMATIONS
Status: Deemed expired
Bibliographic Data
(52) Canadian Patent Classification (CPC):
  • 354/133
(51) International Patent Classification (IPC):
  • G06F 17/30 (2006.01)
(72) Inventors :
  • KLEINBERGER, PAUL J. (Israel)
(73) Owners :
  • KLEINBERGER, PAUL J. (Not Available)
  • TNET, INC. (Israel)
(71) Applicants :
(74) Agent: BORDEN LADNER GERVAIS LLP
(74) Associate agent:
(45) Issued: 1990-11-20
(22) Filed Date: 1987-12-04
Availability of licence: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): No

(30) Application Priority Data:
Application No. Country/Territory Date
938,163 United States of America 1986-12-04

Abstracts

English Abstract


ABSTRACT OF THE DISCLOSURE

A computerized information retrieval system provides a
break-down of the major and minor subject areas covered by a group
of texts selected from a text-database. Each text has associated
with it a set of descriptive keywords, which may or may not be
identical to a list of the words in the text. A user enters a
descriptive word, or words in some boolean combination, and the
system selects from among all the texts those whose keywords which
fulfill the logical test constituted by the user's search request.
A group of texts answering to the description having been found,
the system organizes the texts into sub-groups defined by the
keywords which the texts of each sub-group have in common, over
and above those keywords which are held in common by virtue of all
the texts conforming to the original search request. An
additional method of analysis provides a measure of the degree of
similarity between words or collections of words such as
sentences. This provides a facility for searching for text whose
keywords are similar to the words in a user's search request.


Claims

Note: Claims are shown in the official language in which they were submitted.



THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE
PROPERTY OR PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:

1. A method utilizing processor means and
associated memory means for making explicit the inherent
relationships among a first group of texts stored in the
memory means and being associated with at least one
keyword, the method comprising the steps of:
(a) defining as a "criterion key" that
keyword which among all the keywords associated with any
of the texts in the first group of texts, is associated
with the largest number of texts within the first group
of texts;
(b) separating the first group of texts into
first and second sub-groups of texts which are sub-
groups of the first group of texts, the first sub-group
of texts having as one of its keywords the criterion
key and the second sub-group of texts not including the
criterion key as a keyword; and
(c) displaying results obtained from the
above-described steps on a suitable display medium.

2. A method in accordance with claim 1, wherein
the above process is applied recursively to at least
one of the first and second sub-groups.

3. A method in accordance with claim 2, wherein
the results of separating the first group of texts into
sub-groups is displayed according to the following
method:
(a) if there exist keywords common to all
of the texts in the first group, a line of display
representing the first group is generated by listing on
the display medium some or all of those keywords
associated with all of the texts in that group;

41


(b) if a first sub-group is present, another
line of display is generated representing the first
sub-group, indented further to the right, listing some or
all of the keywords common to that first sub-group;
(c) additional lines of display may be
generated by treating the first sub-group as the "first
group" and applying the method of claim 1 and the pre-
sent claim to that group recursively; and
(d) if a second sub-group is present, then
another line of display is generated representing the
second subgroup, below the line generated in step (a)
and below any lines generated by step (b) and (c), and
with the same indentation as the line generated by step
(a), by treating the second sub-group as the "first
group" and applying the method of claim 1 and the pre-
sent claim, whereby a listing is created in traditional
outline format, such that whenever a line of display
represents a group which is further analyzed, the con-
sequent subdivision into further sub-groups is detailed
in a line or lines appearing below it and to the right,
and inversely when a group of texts is included within
another group, the line representing the included group
appears below and to the right of the line representing
the group of which it is a part.


4. A method in accordance with claim 1, wherein
the above method is applied recursively to each of the
sub-groups until the sub-groups contain only one text.


5. A method in accordance with claim 1, wherein
the process of choosing the criterion key is applied
repeatedly to the first group of texts, excluding as
criterion keys those keywords which have already been
used as criterion keys in a previous iteration.

42


6. A method in accordance with claim 1, wherein
the criterion keys are chosen in alphabetical order.

7. A method in accordance with claim 1, wherein
the criterion key is specified by the user.

8. A method in accordance with claim 2, wherein
the criterion keys are selected from a user-supplied
outline of keywords.

9. A method in accordance with claim 1, wherein
substantially all the words of each text are considered
to be keywords.

10. A method in accordance with claim 1, wherein
all the words of the texts, with the exception of a pre-
determined list of words, are considered to be keywords.


11. A method in accordance with claim 1, including
the step of scanning the keywords associated with all
the texts in the first group of texts for the presence
of a designated keyword and eliminating that keyword
from the keyword list of each text.

12. A method in accordance with claim 1, including
the step of scanning the keywords associated with the
first group of texts for the presence of a designated
keyword and replacing that keyword with a designated
substitute keyword.



13. A method in accordance with claim 1, including
the step of automatically allocating keywords by
scanning the texts of the first group of texts and com-
paring the scanned text on a word by word basis with a
set of "scan" words and allocating the scan word, or a
designated substitute for the scan word, to be a keyword
for that text, for all of the texts for which a match is
43


found.

14. A method in accordance with claim 12, wherein
all the words allocated as keywords for any of the texts
are used as scan words.


15. A method utilizing a processor means and asso-
ciated memory means for finding words in a collection of
target words stored in the memory means which are simi-
lar to a source word, comprising the steps of:
(a) separating the source word into its com-
ponent character strings, where a character string is
any contiguous subset of a word's characters in their
original order;
(b) searching the target words for the same
character string;
(c) allocating points to each target word for
each match between a character string representing part
of the source word and a character string representing
part of the target word; and
(d) ranking words of the target word list in
terms of similarity to the source word, according to
the total number of matches that are found between the
character strings of the source word and the character
strings of the target words.

16. A method in accordance with claim 15, in which
the points allocated to each of the target words are
weighted by a factor which depends on the length of the
character string being compared.

17. A method in accordance with claim 15, in which
the points allocated to each of the target words are
weighted by a factor which depends on the size of at
least one of the source words and the target words.
44


18. A method in accordance with claim 15, wherein
the points allocated are weighted by a factor which
depends on the location of the character string grouping
within at least one of the source and target words.


19. A method in accordance with claim 15, wherein
the points allocated are modified by a factor which
depends on whether the particular character string being
compared exists in a supplied list of character strings,
such as common suffixes and/or prefixes, or any other
supplied list of character strings.


20. A method in accordance with claim 15, wherein
the source word is first converted to an approximate
phonetic form prior to searching the target word list.


21. A method in accordance with claim 16, wherein
the source word is first separated into its component
syllables prior to searching the target words list, and
points are allocated based on matches between the
resulting groups of syllables.


22. A method utilizing programmed processor means
and associated memory means for finding a collection of
words, such as a phrase or a sentence, in a target word
list stored in the memory means similar to some source
word collection, comprising the steps of:
(a) separating the source word collection
into its component sub-groups of words, wherein a sub-
group is any set of one or more words which were con-
tiguous in the source word collection in their original
order;
(b) searching the target word list for iden-
tical sub-groups;




(c) allocating points to each word collection
in the target word list for each match between a sub-
group of words representing part of the source word
collection and a sub-group of words representing part of
that target word collection; and
(d) ranking word collections of the target
word list in terms of their similarity to the source
word collections, according to the total number of
matches that are found between the sub-groups taken from
the source word collection and the sub-groups belonging
to word collections of the target word list.

23. A system utilizing processor means and memory
means for making explicitly inherent relationships among
a first group of texts stored in the memory means and
being associated with at least one keyword, the system
comprising:
(a) first program means located in the memory
means and executable by the processor means for defining
as a "criterion key" that keyword which among all the
keywords associated with any of the text in the first
group of texts, is associated with the largest number of
texts within the first group of texts;

(b) second program means located in the
memory means and executable by the processor means for
separating the first group of texts into first and
second sub-groups of texts which are sub-groups of the
first group of texts, the first sub-group of texts
having as one of its keywords the criterion key and the
second sub-group of texts not including the criterion
key as a keyword; and
(c) third program means located in the memory
means and executable by the processor means for
displaying on a suitable display medium a visual indication
of which texts comprise the first and second sub-groups of
texts.
46

Description

Note: Descriptions are shown in the official language in which they were submitted.


~76~;~8

INFORMATION RRTRIEVAL SYSTEM AND METHOD

BACKGROUND OF THE I~VENTION
The present invention relates to an information
retrieval system and method which analyzes and summarizes
the information contained in a group of texts and
identifies similar words and word collections.
"Information retrieval" is the process of
selecting and presenting specific items from within a
large and heterogeneous collection of texts, according
to users' descriptions of the subjects in which they are
interested.
Some information retrieval systems index all
the words appearing in all the texts, others index
"keywords" which are descriptors assigned to each text
by the text's author or by someone else. In both cases
the user who wants to find a text does so by asking for
a search on a particular word, or on logical (Boolean)
combinations of words, or on words with some maximum
distance (or similar relationship) between them in the
texts, etc. In addition to requesting a specific word
or words, most systems allow the user to search for a
character string; e.g., LEXIS* and DIALOG*.
A typical search request, on traditional
systems, generates a long list or a large collection of
texts all of which logically satisfy the search cri-
terion, but only a small percentage of which will
actually be of use. The user is forced to expend much
time and much energy winnowing (searching) through the
texts found by the system, to pick out those truly rele-
vant to his needs.
This problem originates from the fact that theuser typically does not have EXACT knowledge in advance
of how the subjects of interest to him will have been
described.
*Trade Mark

-- 1 --

1276728

*If his description is very specific, he
will lose information: anything relevant to
his needs but described in a slightly different
manner will not be found by the system.
*If his description is very general, many
irrelevant texts will be found also, and the
winnowing process will be costly, time con-
suming, and tiring.

For this reason:
*On the level of office systems and personal
computers, despite the proliferation of com-
puters and the wide use of STRUCTURED data-
bases, the use of personal and interpersonal
catch-all text-based information retrieval
systems is almost unknown -- the bother and
the overhead involved in using traditional
systems are too great to make the effort worth
the trouble.
*On the level of massive public data-bases,
most bibliographic information systems have
attempted to solve the problem by limiting
users to a predetermined vocabulary of accep-
table keywords. (Users have a reasonable
chance of quessing what their subject will
have been called, since both users and authors
are confined to that published list of keyword
possibilities.) This solution has been
workable but at a price:
(1) To be an effective user of such
databases one must study and develop
3~ expertise in the use of the system.
They are, thus, inaccessible to
untrained users, and inappropriate
for casual use.
- 2 -


.~

1276728

(2) Because of their rigid structure,
such systems are of limited use (and
indeed are little used) in dynamic
environments such as would be found,
for example, in the case of an
unstructured corporation-wide catch-
all collection of information.

Two general types of information retrieval systems
and methods currently in use are as follows:
(1) In a first method, once the information
retrieval system (whatever their selection methodology)
have isolated or identified the group of texts which
satisfy the user's search criterion, the systems present
the user with a count of the number of texts within that
group, and the opportunity of sequentially reviewing the
texts which are mem~ers of that group.
The user either looks through the texts them-
selves, one by one, or looks through se~uential listings
of some part of the information available about each
text: that is, the user may choose to review a sequen-
tial list of the titles of the texts, or abstracts of
the texts, or lists of keywords of the texts, or the
initial paragraphs of the texts, or the dates and ori
gins of the texts, or some combinations of the above.
The user is then given some method of specifying
(usually by number) those texts for which he wishes
fuller information printouts, etc.
An example of this type of information retrieval
system is the DIALOG information retrieval system.
DIALOG* provides a user with the number of records (texts)
satisfying the search request. The user can then request
that any or all of the records be displayed and/or printed
in any one of a number of formats containing varying
and differing amounts of information.
*Trade Mark


1~7672~3

(~) A second method is generally used when
the number of texts presented by an initial search is
too large, or the original search criterion was too
general, to make it practical for the user to look
through sequential listings to pick out the texts he
wants. This method is essentially an extension of the
original boolean search facility: the user can ask for
additional searches to be made, and then can manipulate
the additional lists of texts thus generated by
requesting further lists to be created based on Boolean
combinations of the preceding lists (e.g., the new list
to include all the texts on list "A" and also on list
"B" but to exclude any which appear on list "C", etc.).
An example of this type of information retrieval
system is DIALOG*, where the user can make additional
search requests, and create new lists of texts based on
Boolean combinations of preceding lists.
Another example of this type of information
retrieval system is the LEXIS* system wherein a user can
modify his/her search request in an effort to narrow
down the number of cases (texts) developed from the
initial search request.
These methods, Boolean combinations of lists
and sequential screening or printouts of the texts
themselves or of some subset of the information
available about each text, generally constitute the
state of the art in information retrieval at this time,
for the phase of the retrieval process extending between
the point at which the retrieval system has identified a
group of texts as being responsive to the search
criterion, and the point at which the user chooses and is
presented with the individual texts which he judges to
be actually germane to his needs.
In addition to other advantages, the present
invention solves the problems described above, by making
it possible for the user to see at a glance a break-down
~Trade Mark


1276728
of the -types of information contained in the texts
selected by his initial request. From the generated
display, the user can choose the texts which are rele-
vant to his true interest both easily and quickly.
The present invention also relates to a system
and method for identifying words in a target word list
which are similar to a source word, and/or for iden-
tifying phrases or sentences in a target population
which are similar to a source phrase or sentence.
Computer programs are used in a number of con-
texts to obtain words which are "similar" to some given
source word, most notably in indexing and information
retrieval programs and in spelling checkers. In
indexing and retrieval programs, the purpose of such a
search for "similar" words is to provide a more
exhaustive list of terms related to the input word, such
as plurals or forms modified by prefixes or suffixes.
In the case of spelling checkers, the purpose is to be
able to make a suggestion as to the most likely word the
user had intended, once a word is encountered which does
not appear in the program's dictionary.
In the traditional and simplest solution to
this problem, most often used in indexing and retrieval
programs, the user specifies the exact nature of the
relationship between the source word and the words being
sought by means of "wild card" symbols, most typically
the '?' '*' characters. In this protocol, the user
instructs the program exactly which parts of a word he
is interested in matching, and in which parts other
characters may appear, the question mark '?' being used
to signify any individual character and the asterisk '*'
any sequence of characters. Thus, by way of example,
the user would ask for "law" if he intended to find
words like "laws", "lawyer" or "lawless". Or a search
for "analy?e" could he used to locate both the American
(with a 'z') and the British (with an 's') spellings of
the word.

_ " _

1276728

In the case of spelling checkers, a more
flexible approach is needed, since the user does not
usually know that he has made a spelling mistake, nor
does he know in advance the relationship between the way
he thinks a word is spelled and the way it is spelled in
fact. Most typically, spelling checkers locate
"similar" words by first restricting the search to words
beginning with the same letter as the misspelled words
and then use a list of common spelling and typographical
errors to find words which differ from the source word
only by these letters.
An alternate approach used by spelling checkers
is to convert the word to an approximate phonetic form,
and then search a dictionary of such phonetic words,
on the assumption that the user typically has a clearer
idea of how the word sounds than of how it is spelled.
This last approach is usually quite effective at
finding spelling errors, though it suffers from the
drawback of being unable to deal with typographical
mistakes. This technique is therefore quite commonly
combined with elements of the previously mentioned
appoach, in order to obtain a more comprehensive list
of possible words.
Some information retrieval programs use the
phonetic approach also: along with a regular index of
words (or of keywords) in their textbase, they create a
parallel index in which those same words are represented
phonetically. Search requests are then converted to
phonetic format and the attempt is made to locate the
search words' phonetic translation in the phonetic
index. An example of this is the COMPUMARK* system
which is used in searching for trademarks.
Regarding "similar sentences", the state-of-
the-art is more simply described. There are complex
systems which actually parse sentences into their com-
ponent parts of speech and analyze the semantic relation~
ships among those parts; however, the applicant is not
*Trade Mar~
~ 6 -
,~,. ~

~276~728
aware of any retrieval systems in which the sequencing
of the words in a search request (as distinguished
from the identity of the search-request words and the
specified logical relationships among them) is used
to influence the choice of the texts to be retrieved,
or the ordering or ranking of the texts once they are
found.

SUMMARY OF THE INVENTION
The analyzing and summarizing aspect of the
invention makes explicit the inherent relationships
among a group of texts with associated keyword descrip-
tions, by analyzing the keywords held in common by
subgroups of texts within the overall group. The inven-
tion comes into play once a group of texts has been
selected using standard search methodology -- at the
point at which the user would either have to make
further guesses as to how to narrow down his search cri-
terion, or would be presentea with a sequence of texts
that would then have to be "winnowed through."
The invention is a system and me~hod of ana-
lyzing and of presenting the informational content of
this group of texts, as a group. The user sees pre-
sented on a display medium (screen) the equivalent of an
annotated "TABEE OF CONTENTS," organized as a standard
outline or in some similarly graphic format, analyzing
that group of texts into major subject areas, sub-
categories, sub-sub-categories, etc. Each "TABLE OF
CONTENTS" outline is dynamically generated in response
to specific search requests, and constitutes a kind of
"birds'-eye view" of the contents of the textbase in
that subject area at that time. For the user looking
for a specific kind of information of which he has only
a general description (thatls the typical case), a
glance at the table of contents, a matter of seconds,
usually suffices to eliminate from consideration most of
the irrelevant material. Relevant sub-categories

-- 7
~ .

1~767~8

usually are immediately evident. If necessary, the user
can pick out for further analysis (in one implementation
just by moving the cursor on the screen) a much-reduced
group of texts (one of the categories presented to him
in the table of contents) and repeat the analysis pro-
cess, creating another table of contents, this time of
the sub-category. One or two iterations will usually
suffice, even when starting with a group of hundreds of
texts, to ~et to a table of contents in which most
descriptions will be of individual specific texts rather
than of groups of texts. With an appropriate command;
e.g., by moving a cursor and pressing a key, the user
chooses the specific text or texts he wants to see
according to the descriptions he sees on the table of
contents, and with an appropriate command; e.g., a
keystroke, brings those texts to the screen or sends
them to be printed.
It is anticipated that this technique will
lead to an extension of the use of information retrieval
into areas where it had not been convenient or practical
to use it before, some examples being as follows:
(a) Specific information can be located much
more rapidly than had been possible using prior tech-
nology. Experience so far has shown that the time
needed for finding specific information in a large text-
base is on the order of 10~ of what it normally would
take, and the process is far more agreeable.
(b) It is practical to search for information
whenever the user knows some general characteristics of
what he's looking for, even though he has no idea how
that information may have been specifically described in
the textbase.
(c) It is practical to maintain and use large
heterogeneous collections of textual information, and to
do searches to find specific elements of that infor-
mation, without limiting authors to a predetermined
lexicon of "keywords." This means that users (e.g , in
_ 8

lZ767;~8

a corporate environment) can choose keywords spon-
taneously and still succeed in finding relevant infor-
mation among each others' entries.
(d) "Browsing" the textbase becomes a
pleasant and meaningful operation, quite different from
paging through texts or reading the thesaurus, which are
the only "browsing" techniques available in traditional
technology.
In one implementation of the invention, the
information retrieval system is coupled with a word-
processor, for convenience in entering texts into the
textbase, and with an output screen presenting the
results of the above analysis in traditional outline
format. The output screen shows the categories and sub-
categories of subjects found to be included in the texts
selected as a result of the original search request, to
any desired level of detail. The user moves the screen
cursor to point to a category on the outline for which
he wants a more detailed break-down, and the process
continues until individual texts are being referenced on
the outline. Then by pressing a key the user can direct
the system to send the chosen text to the printer or
bring it to the screen.
The significance of the invention is that the
amount of time needed for the user to isolate texts of
interest to him, from among groups of texts which
satisfy his initial search request but are in fact irre-
levant to him, is reduced by a large factor. The infor-
mation retrieval process is made more convenient
thereby; it is practical using this system to find
specific texts with only the most minimal initial infor-
mation as to how they may have been keyworded; and
various practical constraints which have restricted the
ways in which textbases needed to be organized in order
to guarantee that stored information could be found
again, can be relaxed.

1~76728

The invention also relates to a pr~cess which
enables the computer to locate `'similar" words in a
manner more flexible and more exhaustive than any
currently used technique, so far as we know. In par-
ticular, the invention does not require any specifica-
tion by the user as to the relationship between the
input word and the target words, nor does it rely on
phonetic translation or any restrictive list of typical
mistakes. The invention rather makes use of the actual
structure of the word itself, and searches for words
which have a similar structure or which include a similar
structure as part of a larger structure. The invention
is therefore able to locate a far more comprehensive
list of "similar" words than is the case with other
techniques.
The structure of the input word is analyzed in
terms of groups of letters, starting with letter pairs
and working up to larger groups, and accords to any word
in the target dictionary which contains these letter
groupings a number of points determined by the size of
the group and/or its location in the word. Words which
are given a large number of points by the process are
then presented to the user, in descending order of the
number of points allocated, for his selection.
In the case of searching for similar sentences
rather than similar words, the technique is identical,
except that groups of words, rather than groups of let-
ters, are compared. One field of application for this
invention is in information retrieval systems, where the
user presents his search request in the form of a phrase
or sentence, and texts are selected from the data-base
and/or prioritized, according to the scores achieved
when either their descriptions (keywords, title,
abstract) or the texts themselves are evaluated
according to this method. Since in information
retrieval systems the typical search request finds many
texts which are, in fact, irrelevant to the user, the

-- 10

~ ~76q28

invention, when employed to automatically winnow and/or
prioritize texts can save time and trouble for users of
the system.
These and various other advantages and
features of novelty which characterize the invention are
pointed out with particularity in the claims annexed
hereto and forming a part hereof. However, for a better
understanding of the invention, i. advantages, and
objects obtained by its use, reference should be made to
the drawings which form a further. part hereof, and to
the accompanying descriptive matter, in which there is
illustrated and described a preferred embodiment of the
invention.

BRIEF DESCRIPTION OF THE DRAWINGS
In the drawings, in which like reference
numerals and letters indicate corresponding parts
throughout the several views,
Figure 1 is a schematic view illustrating a
stand-alone computer system wherein the present inven-
tion might be utilized;
Figure 2 is a schematic view illustrating acomputer terminal system interconnected to a remote host
computer system, the present invention being implemented
in either or both computer systems;
Figures 3A-G are schematic views illustrating
file structures of an embodiment of an information
retrieval system and method in accordance with the prin-
ciples of the present invention;
Figure 4 is a schematic illustrating the addi-
tion of text to a textbase in accordance with the prin-
ciples of the present invention;
Figure 5 is a schematic view illustrating
searching the textbase of Figure 2 in accordance with
the principles of the present invention;
Figure 6 is a schematic view illustrating
locating of texts which match the search request;

~276728
Figur~ 7 is a schematic view illustrating ana-
lyzing the texts found in the search;
Figure 8 is a schematic view illustrating pre-
sentation of the analysis to a user.
Figure 9 is a view illustrating a sample pre-
sentation of results at a display media;
Figure 10 is a schematic view of the process
shown in Figure 7, but illustrating various additional
features of analysis;
Figure 11 is a schematic view illustrating
automatic keyword modification to groups of text;
Figure 12 is a schematic view illustrating the
preparation of the search request prior to searching the
textbase;
Figures 13A-B are logic flow designs of an
embodiment of the present invention providing the abi-
lity to search for similar words;
Figure 14 is a schematic view illustrating an
embodiment of the present invention for calculating the
degree of similarity between two words;
Figure 15 is a schematic view illustrating the
calculation of the-similarity between two phrases or
sentences or collections of words; and
Figure 16 is a schematic view illustrating the
calculation of point scores used in calculating the
similarity between two phrases or sentences or collec-
tions of words.

DETAILED DESCRIPTION OF A PREFERRED
-
EMBODIMENT OF T~E PRESENT INVENTION
One implementation of the invention is a
program written in the C language, with some sections
written in Assembly language. The implemention to be
described runs on the IBM-PC and compatible micro-
computers. As generally illustrated in Figures 1 and 2,
the program is in principle easily transportable to a
large family of micro, mini, and main-frame computers,
- 12 -

., ~

~X767X8

and can be used in a multi-user environment. For
example, as illustrated in Figure 1, the program is
loaded into the memory of a computer system 30 powered
by a suitable power supply 31. The computer system 30
will include a user input device 32 such as a keyboard
and/or mouse. In addition, the computer system will
preferably include a storage device 33 for storage of
text material. A printer 34 will be provided for hard
copy printout of results and a display terminal 35 will
be provided for display of the program analysis at the
display terminal. As illustrated in ~igure 2, the com-
puter system 30 might be interconnected to a host com-
puter system 36 by any number of different methods such
as by telephone lines, a direct connect via a serial
interface cable, a radio frequency (RF) interconnection,
etc. The program of the present invention might be uti-
lized in the computer system 30 and/or the host computer
36. In a multiuser environment, the program might be
utilized from a dumb terminal 37 interconnected to the
host computer 36.

File Structure
The file structure of an embodiment of a
program in accordance with the principles of the present
invention will now be described.
As illustrated in Figures 3A-G, the textual
information itself and the indexing information
necessary to access it are kept in seven data files.
A text file 42 contains variable-size records
42a of the texts which have been saved in the textbase,
there being a record for each text in the textbase.
This information is ordinarily all kept in one file,
though the possiblity exists of splitting it into
several smaller files if the physical limitations of the
computer system being used prevent a single file of
large enough size being maintained.

~2767X8

~ text pointer file 44 contains information as
to where each individual text is located within the text
file 42 itself. Space in the text file 42 is allocated
as it becomes available (by old texts being deleted or
updated); an ordered list is therefore necessary in
order to locate the desired text at any time. Each
record 44a of the text pointer file 44 includes a Text
Number field a signing a unique number to each of the
texts, a Location field specifying the text location, a
Size field specifying the size of the text, and a Date
field for providing information as to the date each text
was last modified, for use when searching for texts
which meet specific date criteria.
A keyword file 46 contains variable-size
records 46a listing every keyword which has been defined
in the textbase. The keywords may have been defined in
several different ways; for example, the author may
define the keywords as the text is entered, the keywords
may be defined automatically through the use of an
automatic keywording feature as described below, text
down-loaded from a commercial data base may have
keywords already predefined, etc. A keynumber is allo-
cated to each keyword on the basis of its position in
the keyword file 46.
An index of which texts contain which keywords
i5 kept in text keyword file 48. This file contains a
variable-size record 48a for each text in the textbase,
the entries being in the form of the number of the text
being referred to, followed by a list of the keynumbers
of the keywords associated with that text, followed by
an end marker to indicate the end of that list and then
the entry for the next text.
A text index file 50 includes a record SOa for
each text providing an index to the location and size of
the entry in tne text keyword file 48 for each text.
Free key file 52 and free text file 54 are
lists of available space in files 48 and 42, respectively,
- 14 -
, . .

~2'76728

so that spac~ isl those files can be reused as texts
are deleted or updated. The files 52 and 54 and their
associated records 52a,54a have a structure similar
to that of the text index file 50.

Adding Texts To The Textbase
Figure 4 illustrates the process by which
texts are created and saved in the textbase. Texts
might be created at 56 by a word processor function
associated with the program of the present invention, or
by "importing" texts from files which have been created
by other programs, such as by other word processor
programs or texts developed from a data base search
request. The user defines the keywords which he wants
to use to describe that text, either by marking them in
the text itself at 58 or by entering the keywords in a
separate keyword list at 60. Keywords which the user
has marked in the text are automatically scanned at 62
and added to the keyword list; both the text itself and
the keyword list are available for editing throughout
this process.
The same process is used to modify keywords
defined in a text which has previously been saved in the
textbase; the text is retrieved from the textbase in the
normal manner (see below), and is then available for
editing in the word processor.
When the user has finished entering or modifying
the text, he enters a command; e.g., presses a key,
to save it at 64. A number is allocated to the text at
66, based on the next available position in the text
pointer file 44. The keyword list for the text is
then converted to keynumbers at 68, either by finding
the existing keyword in the keyword file 46 or by adding
a new keyword to the file 46. The position of a keyword
in the keyword file 46 corresponds to the keyword number
assigned to that keyword. The text itself, together
with its keyword list, is added to the text file 42 and
_ 15 _
~;7

~767~8

the textnumber and list of keynumbers added to the text-
keyword file 48. The index files 44, 50, 52, and 54 are
then updated with the appropriate information.
In the case of saving a text which previously
existed in the textbase the process remains substan-
tially the same; it is saved using its previous text-
number rather than allocating a new number, and the old
information on keywords which describe that text is
deleted from the text keyword file 48 and replaced with
the new information.

Searching The Textbase
The procedure by which the user searches the
textbase to find a particular text or texts is
illustrated in Figure 5. The user initially enters his
search request at 72, in the form of the keyword or
keywords which describe the information he is looking
for. Boolean combinations or keywords may be used in
the description to logically describe the set of texts
which is being searched for. If the user has asked that
similar words or pre-defined "equivalent" words be
substituted into his search request, the substitution is
made at this time. (This process is described below and
in Figures 12 through 14.) The program then searches
the textbase at 74 to locate all texts which match the
search request, as is shown in further detail in Figure
6. At 76, the program analyzes the set of texts which
are found to satisfy the search request as shown in
Figures 7 and 10, and at 78, the program displays the
results of this analysis at the user's display terminal
(screen) as shown in Figure 8.
As illustrated in Figure 6, texts are selected
by scanning the text-keyword file 48 for each keyword in
the search request, and building a list of the texts
which match the request. This list is constructed ~y
taking from the search request each keyword in turn at
80, looking up its keynumber at 81 in the keyword
- 16 -

~ ~767Z8

file 46, and then scanning the text-keyword file 48 to
find all texts which contain that keynumber at 82. The
numbers of the texts are added to the list as they are
found at 82. This list is then combined at 83 with the
list of texts which had been found by previous itera-
tions of this process, which dealt with keywords men-
tioned in earlier parts of the keyword request. The
lists are combined according to the logical operation
specified by the user. At 84, the process is repeated,
the list produced by each successive iteration being
combined with the list created by all previous itera-
tions, until all the keywords in the search request have
been dealt with. At 85, the program checks if the user
has requested that the search be limited to texts
created (or modified) within certain date limits. If
the user has imposed no such date limits, at 86 the text
selection is terminated. If the user has requested date
limits, at 87 the listed texts are checked against dates
stored in the text pointer file 44 and only those texts
whose creation/modification dates fall within the limits
are retained.
The program then analyzes the set of texts
which has been found and presents the results of that
analysis The process by which the analysis is carried
out, and the manner in which the results are presented,
will now be described.

Analyzing The Texts Found In The Search
The program analyzes the set of texts which
has been found to match the initial search re~uest, by
means of the process shown in Figure 7. First, at 88
the program obtains the list of ~eynumbers associated
with each text in the set. These lists are obtained by
reading them from the text-keyword file 48. The lists
for each text are then scanned at 90 and the number of
texts in which each keynumber occurs is counted in order
to identify the "criterion key" -- the most fre~uently

~276728
occurring keynumber, i~e., the keyword which is associated
with the greatest number of texts in that set.
The set of texts is then divided into two subsets
at 92; the "right-group" containing all texts which
are described by the "criterion key", and the "down-
group" containing those texts which are not described by
the criterion key. The "right-group" is thus a list of
all the texts in the current set which include among
their keywords the "criterion key"; all remaining texts
from the current set are listed in the "down-group".
As will now be described, these two subsets
are then in turn analyzed by the same process of finding
the most commonly occurring keynumber and using it to
split the set of texts; the two sections of the program
at 90 and 92 being performed recursively until all the
texts have been analyzed, or until such time as a deci-
sion is reached not to continue analysis further in
either the "right" or the "down" direction.
If at 94, a decision is made to continue the
analysis further in the "right" direction, then note is
taken at 94a of the identity of the current group, as it
will be the "parent" group for the forthcoming recursive
iteration of the process. Then at 94b, the sub-group
which had been the "right-group" created at 92 is marked
as the new upcoming "current" group, and note is taken
at 94c that it was originally created as a "right-group"
at 92. The analysis routines now invoke themselves
recursively; that is, handling of the previous current
group (now the parent group) is interrupted and the
system begins the analysis of the new current group at
90. ~he full analysis is thus a set of nested pro-
cesses; for a group of text to be fully analyzed, the
analyzing routines first split the initial group into
two sub-groups, and then invoked themselves to handle
the further analysis of each of the resulting sub-
groups. Thus, the process proceeding to handle the new
current group at 90 and at 92 may again be interrupted
- 18 -

.~

~276728

at 94 to handle yet another right group produced at 92during this second iteration, and/or at 96 to handle
analysis of the "down-group" produced at 92 during this
second iteration. The procedure, if the iteration is
interrupted at 96, is similar to that described above at
94a, 94b, and 94c: note is taken at 96a of the identity
of the current group, which will be the parent group for
the upcoming iteration. Then at 96b the sub-group which
had been the "downgroup" created at 92 during the
current iteration is marked as the current group for the
upcoming iteration, note is taken at 96c that it had
originally been a "down-group" and it is processed starting
at 90.
During any iteration, if a decision at 94 is
not to analyze further right, and the decision at 96 is
not to analyze further down, then at 97 a check is made
whether the current group has a parent group, (since the
existence of the parent group means the existence of a
group whose processing had been interrupted at 94 or
96). At 97a, a check is made as to whether the current
group had originally been a "right-group" or a "down-
group", this being a way of identifying the point at
which processing of the parent group had been inter-
rupted. If the current group had been a "right-group",
its parent group (noted at 94a) is reidentified at 97b
as the current group, and its processing is taken up at
96. If the current group had not been a "right-group",
then it had been a "down-group". Its parent group noted
at 96a is reidentified as the current group at 97c and
processing of this reinstated current group continues at
97. ~hus, the process of analysis having been inter-
rupted potentially numerous times for the analysis of
sub-groups and sub sub-groups eventually completes all
the interrupted analyses until eventually a parent group
is reinstated as the current group, which was the origi-
nal group with which the whole analysis procedure was
begun. When th~ analysis of this group proceeds to 97,
- 19 -

~2767Z8

it will be founA to have no parent: group, and the analy-
sis procedure terminates.
In this way every sub-group of the original
group of texts is analyzed to the desired depth and a
"tree" built out of the original list of texts. This
tree is an analysis of the relationships among the
various texts in terms of the keywords which describe
them; it groups related texts together according to the
similarities in their subject matter and locates all the
texts in a structure of headings and sub-headings.
At each node of the tree, the list or node in
the "right" direction defines the texts which belong to
the largest category from the set of texts which was
input to the node, and the list or node in the "down"
direction defines those texts not included in that
largest category. Starting at the root node (the list
of texts generated by the user's original search
request) and reading down from node to node, provides a
Listing of the major categories into which the original
group of texts has been divided.
This listing is automatically sorted into "order
of importance" through the above procedure of selecting
the successive "criterion keys"; the larger the group
of texts described by any particular criterion key,
the closer it will be to the top of the list. The tree,
then, provides a break-down of the original list of
texts into its various subject matters, and can be
extended to any desired level of detail.
Control of the analysis 94, 96 is achieved
either under interactive user control or automatically
on the basis of the number of texts already found and
displayed. In automatic mode, analysis to the "right"
(that is, more detailed analysis of a group of texts
which are described by the "criterion key") is ter-
minated at 94 either when all the texts in the set have
been shown, or when the depth of analysis of that set is
such that further analysis would take up too much space,
- 20 -



~276728

making it impossible to show the "down-list" within the
limits which have been set for the number of lines of
analysis to display. Analysis "down" (those texts which
are not described by the current "criterion key") is
terminated at 96 either when all texts have been shown,
or on reaching a predetermined limit as to the number of
lines to show.
The user may control the analysis process by
setting in advance the number of display lines at which
he wishes automatic analysis to stop, or interactively
by at each stage in the process deciding whether to
further continue analysis either "right" or "down", and
how far to continue it in either direction. In addi-
tion, as specified below, the user may invoke various
additional features affecting the procedure of analysis
as generally illustrated in Figure 10.

Results Of The Analysis
As generally illustrated in Figure 8, the
results of the analysis procedure described above is
presented to the user as a screen display, indicating
the groups of texts which have been found and their
relationships to each other, in the form of a "table of
contents" of headings, sub-headings, and tests.
The process by which this table of contents is
created is illustrated in Figure 8. First, the descrip-
tion of the original search request is displayed on the
screen at 98. Then the first node, or "trunk" of the
tree, (being the information provided by the analysis of
the first group of texts to be analyzed at 90) is
referenced at 100 and the keywords which describe it
(the criterion key, and any other keywords which are
common to all the texts of the group) are put on the
screen at 102. The existence of any texts which that
node completely describes (that is, texts all of whose
~eywords have by now appeared on the screen) is then
indicated on the screen at 102 by using an arrow symbol

~ - 21 -

~76728

to represent ~hem. Non-printing codes which include the
text's text-number are embedded in the table of contents
at this point. These codes are used later, if the user
asks to see the text whose existence is indicated by the
arrow displayed If there is at this time no right node
(because the user, controlling the analysis interac-
tively, chose not to split the group, or because a pre-
set maximum depth of analysis had been reached), yet the
group does still contain texts which have not yet been
completely analyzed (i.e., texs some of whose keywords
have not yet appeared on the screen), then this fact is
indicated by showing the number of such texts in
brackets; e.g., "(8)". In this case, non-printing codes
are embedded in the table of contents giving the loca-
tion in memory of information about this node/subgroup,
including the list of texts belonging to it. This
information is used later if the user asks to "expand"
the analysis of this group's texts, or to perform some
other manipulation on the texts of this group.
At 104, the program then checks if there is a
"right-node" associated with the current node (such a
right node will have been produced by the analysis if
there is room to expand further to the right in the
outline, and if there are still texts with unexamined
keywords in the node). If such a "right-node" exists,
the count of how far to indent the next line on the
screen or printer is increased by one at 106.
At this point, the routine we are describing
invokes itself recursively. ~he handling of the
recursive process ~104a,b,c) parallels that described above
for 94a,b,c as the handling of 108a,b,c follows that of
96a,b,c. Indeed, the entire procedure described in
Figure 8 para~lels that described in Figure 7, with the
difference that Figure 7 describes the splitting of the
groups of text into sub-groups (at 90,92), and Figure 8
describes the display of information about each group at

" .
- 22 -

~7~76728

102 and controls the level of indentation of the display
lines at 106 and 110. Control of the return from recur-
sive iterations at 112, 112a,b,c parallels that
described above for 97, 97a,b,c.
Thus the transition from 106 to 100 in Figure 8
is a recursive invokation of the routine being described.
Without the routine having completed its activity, the
"right-node" is now designated as the current node to be
handled 104b (the node whose processing is interrupted
being referred to as the "parent~node") 104a, the level
of indentation on the display is incremented 106, and
the very same routine starts out "from the top" handling
the current node (which had been the right-node) as if
it were being invoked for the first time. Thus, the
routine described in Figure 8 invokes itself; while still
in the middle of handling the root node, it calls itself
to handle the right-node.
The new current node is then handled as described,
including the handling of its own right and down nodes,
until the process runs to completion at 112. At 112
there are unfinished nodes to be handled, at 112a this
node's parent node is seen to have been a right node,
at 112b it is reinstated as current node and its processing
continues at 108, which is just after the point at
which handling of the node had been interrupted in order
to handle its right node.
Next, the progam checks whether a "down-node"
exists at 108. If so, it is identified as the current
node (without changing the indentation), a processes
similar to the one just described is undertaken at 108a,b,c,
and the routine invokes itself again 100. Thus,
handling of the parent node is again suspended while the
down-node (now the current node in the new invokation)
is handled. When work on the down-node (which includes
work on any of its subordinate nodes) reaches 112 and
112a, the parent node is reinstated as current node at
112c, and the level of indentation used (at 102) in
- 23 -

767X8

creat-in~ ~isL)lay lin~s is reduced by one at llO. Since,
in the example we have keen running through, the node
which is now the current node was the original "root"
node, at 112 the display process terminates at 115.
Thus, processing of the root-node (the node
first supplied by the original text search) is
interrupted first to process the right-node, and then to
process the down-node. Each of those processes may in
turn be interrupted to process right-nodes and down-
nodes, each of which may in turn be interrupted, etc.
Each time that the processing of a given nodeterminates (when there is no further right-node and no
further down-node to be handled) the program checks at
112 if there are unfinished nodes to process. If such
nodes exist, at 112a control is returned to the parent
node from which the routine was invoked, and the processing
picks up where it left off. In the case of the root-
node, there is no parent-node, and the process ter-
minates at 115, the whole table of contents having been
displayed.
Illustrated in Figure 9 is an example of such
a screen display. The first-line is a heading indi-
cating the search request which created this analysis.
The remainder of the display represents, by showing the
successive criterion keys as headings, the results of
the analysis in the form of an originated "table of con-
tents" of the section of the textbase under analysis.
In this "table of contents", lines ending with
an arrow, such as line I.B.2, represent the presence of
a text which includes only the keywords shown in that
line and in the headings above it. In the case of this
example, a text has been associated with the keywords
"fruit", "oranges" and "jaffa". Analysis right on this
text has been completed. If there were more than one
text with these keywords, a series of right arrows would
be shown on the line, one for each text.

- 24 -

~76728

Lines in the table of contents with a number
shown in brackets, such as line I.A.l, indicate that
there are that number of texts including the keywords
shown in that line and in the headings above, as well as
other keywords, and those texts are not shown indivi-
dually in this analysis (i.e., analysis "right" has been
terminated at this level).
Line I.C. above shows that there are other
categories of texts not included in this table of con-
tents (i.e., analysis "down" has been terminated at this
point).

Using The Table of Contents
The user can either review the texts indicated
by the analysis, or ask for a further "expansion" of a
group of texts which have not been fully analyzed. The
user moves the cursor up and down on the screen to point
at the text or group of texts he is interested in; and
then presses a key to request that the text be displayed
by the word processor or that the group be expanded.
If a text is to be displayed, its number is
taken from the non-printing codes embedded in the table
of contents. That number corresponds to an entry in the
text pointer file 44, where the location of the text
itself (within the text file 42) is indicated. The text
is read from the text file and passed to the word pro-
cessor for reading, editing, or printing.
If a group's analysis is to be expanded, the
program refers to the non-printing codes embedded in
the table of contents to find the location in memory of
the list of texts and other information associated with
the group. The information is then passed to the analysis
and display routines previously described (Figure 7
and 8). This new analysis is presented in a new screen
display, to be used in the same way as the "parent"
analysis; the user can continue to "expand" any group until
- 25 -

1~767X8

he finds and loads the text he is searching for, or can
at any time return to a previous "parent" table of con-
tents to look at a different group of texts.
Further information might be provided to the
user by a special header which appears at the top of the
screen whenever he stops moving the cursor on the table
of contents; this header indicates the list of keywords
describing either the text or the group of texts which
that line represents. In addition, screen highlighting
might be used to indicate that the cursor is pointing at
a specific text, or to indicate all lines of the display
which are contained in the group referred to by the
cursor.

ADDITIONAL FEATURES
Figure 10 illustrates the basic text analysis
procedure of Figure 7 with additional features being
present for enabling the user to modify the basic text
analysis.

Keyword Manipulation During Analysis
It is possible to "hide" specified keywords so
that they are removed at 88a and do not appear in the
analysis at all, to "ignore" keywords at 90a so that
they are shown in the results of the analysis but are
never used to split the set of texts, or to declare cer-
tain keywords as "equal" to each other and substitute
therefor at 88b so that they are treated as identical
for purposes of the analysis.
Hiding keywords can be useful in cases where
some group of keywords, which would otherwise influence
the display, are irrelevant for a particular purpose at
hand. If the user has asked for words to be "hidden",
then at the time that the program obtains the lists of
keynumbers associated with each text at 88 by reading
them from the text-keyword file 48, the lists are compared
to the list of words to be hidden. Keynumbers found
- 26 -
~;
.~,._,

1276728

on the "hidden" list are simply skipped at 88a, not
included in the keyword lists which are subsequently
used for the analysis.
Making keywords "equal" to each other is useful
in cases where disparate categories are equivalent
with r~spect to a particular task at hand. One frequent
case, in particular, is that in which several different
keywords have been used (perhaps by different users of a
common textbase) to describe what is in fact the same
category of information. Words on the "equal" list are
arranged in groups of words which will be made "equal"
to each other-. If the user has asked for words to be
made "equal" to each other, then at the time that the
program obtains the list of keynumbers at 88 by reading
them from the text-keyword file 48, the "equals list" is
scanned with each keynumber read from the file. Whenever
a match is found, the keynumber which headed the group
of equivalent keys on the "equals list" is retained,
in place of the number which was actually read from
the file at 88b. Thus, the user has effectively
changed categories, and potentially combined categories,
for the purpose of the current analysis, although the
files themselves remain unchanged.
Words which the user has specified are to be
"ignored" are included in the lists unchanged, but such
a word is never allowed to become a "criterion key". If
the user has supplied a list of words to be "ignored",
then each time a criterion key is chosen at 90, the
chosen keynumber is compared to the list of numbers of
keys to be ignored at 90a. If the chosen key is found
to be on that list, then it is disallowed as a criterion
key at 90b, and the most popular key not on the "ignore"
list is chosen as criterion key in its stead. This
allows such words to appear on the display without
affecting the manner in which sub-groups are defined.


- 27 -

~76~Z8

User Supplied Criteria During Analysis
If the user is controlling the analysis
interactively and indicates at 88c a desire to specify
the criterion key directly, a list of all the keywords
associated with the texts in the current subgroup
(available from the procedure at 90) is placed on the
screen. The user enters his choice for the criterion
key at 90c, that word's keynumber is found by scanning
the keyword file 46 (the key number is the word's posi-
tion in that file), and that number becomes the cri-
terion key according to which the group of texts is
split into "right" and "down" subgroups at 92. This
input of a criterion key by the user at 90c then repla-
ces the counting operation at 90 and is used to split the
group into two groups at 92.
The user may, in fact, supply at 90c not just
one word, but a logical combination of keywords, thus
creating a "local" Boolean search request, which is then
! analyzed just as was the original main search request,
as described above and in Figure 6. The current group
of texts is then split into two groups, those satisfying
the "local" Boolean search request going to the "right"
group, those which do not satisfy it going to the "down"
group.

User Supplied Criteria In Outline Format
It is possible for the user to specify in
advance the criterion keys or local search requests
which are to be used for some or all iterations of the
analysis procedure (i.e., for splitting some or all of
the resultant sub-groups). The user provides this
specification in traditional outline format. The first
line of the outline becomes the "current line", and
provides the criterion (word or logical combination of
words) for the first analysis at 90d, which then
proceeds as described in the preceding paragraph at 92.
If the line following the current line on the outline is
- 28 -

~2767Z8

indented further to the right, then it will be used (and
become the current analysis, if any, is in turn analyzed.
The next line of the outline which has the same level of
indentationas the current line, if any, provides the
criterion for analysis (and becomes the "current line")
when the down-group created by the current analysis, if
any, is in turn analyzed. When the right and down
groups are in turn analyzed, the lines of the outline
which had been respectively selected are then treated as
the "current line", and further lines are identified for
use in analyzing the resultant new right and down
groups. Thus, input of a criterion key from a user-
supplied outline at 90d then replaces the counting
operation at 90 and is used to split the group into two
groups at 92. Any group being analyzed for which no
line of the outline has been designated to provide the
analysis criterion, is split in either the usual automa-
tic or the usual interactive manner, as described above.
Lines from the outline supplying criteria in
the analysis of groups of text at 90d are then repro-
duced as part of the display lines generated at 102.
The result of this procedure is that the user provides
an outline of his subject matter, and the system fills
that outline with references to whatever texts in his
textbase are relevant to each part and sub-part of the
outline.

Text-Scanning as Criterion During Analysis
An additional method of controlling the analysis
is to split the group according to the success or
failure of a scan for the presence or absence of words
(or pairs or groups of words with a specified degree of
contiguity) within the texts themselves (rather than
checking for words within the keyword list) at 90e.
This scanning operation at 90e then replaces the
counting operation at 90 and the results are used to
split the group into two groups at 92. This provides a

! 29
., .~

12767~8

facility for the use of what are called "full text
searching techniques" (information retrieval techniques
not based on designated keywords) in the context of a
retrieval system whose major functions are based on the
use of keywords. Thus, for example, one could search
for a designated pair of words occurring within a same
paragraph by scanning for the words within the
restricted group of texts which has been isolated at
some intermediate stage of the analysis.

Mathematical Calculations Used as Criterion
During Analysis
An additional method of controlling the
analysis is to split the group of texts into two groups
in a manner dependent on the results of a mathematical
calculation performed on a number or numbers found
either within the text or among the text's keywords at
90f. The numbers to be used are identified by having a
particular position on the text's keyword list, or by
having a particular position with respect to some
designated word found on the keyword list, or by having
a particular positional relationship to some designated
word found in the text itself. This numerical calcula-
tion at 90f, either on the data found in the keyword
list at 88 or on the data found by scanning the text at
90e, then replaces the counting operation at 90 and the
resu]ts are used to split the group into two groups at
92. As an example, a group of texts might be split into
a right-group and a down-group according to their suc-
cess in fulfilling the criterion "cost is greater than
100", where the number to be inspected is either
whatever keyword follows the keyword "cost" on the text's
keyword list, or whatever word follows the word "cost"
in the text itself.


- 30 -

1276728

Similarity Scoring as Criterion
Durinq Analysis
An additional method of controlling the analysis
is to inspect the sequence of the keywords for each
text (as read in from the text-keyword file 48 at 88)
and to split the group of texts into two groups
according to criteria dependent on the order of the
keywords at 90g. The process by which this prioritiza-
tion takes place is described below and in Figures 15
and 16. This process then replaces the counting opera-
tion at 90 and the results are used to split the group
into two groups at 92. One use of such an analysis is
to provide for prioritization of a group of texts
according to the degree of similarity between the set of
that text's keywords, treated as a phrase or sentence,
and the user's search request, treated as a phrase or
sentence, according to the method of measuring simi-
larity described below and in Figures 15 and 16. The
same method can be implemented in a comparison between
the search request and the texts themselves, or portions
of the texts, or non-keyword information (e.g., titles,
abstracts) associated with the texts.

Automatic Keywording
One embodiment of the present invention
enables users to allocate keywords automatically to
texts created by the word processor or texts imported
from outside sources.
The procedure by which keywords are allocated
automatically will now be described, referring generally
to Figure 11. The user designates a group of texts by
pointing to it on a table of contents at 200, as he
would if he were asking to further expand that group.
He then presses a key to indicate his desire to alter
the keywords of that group (which can, of course,
include the entire textbase). He supplies a list of
keywords to be added ("add words"), keywords to be
- 31 -

~276728

eliminated wherever found ("delete words"), and words to
become keywords if they are found within the texts them-
selves ("scan words") at 202. Among the scan words, the
symbol "**" is understood to mean that all keywords in
the keyword file 46, or all keywords in the keyword file
46 with the exception of a designated list of words, are
to be used as scan words.
The implementation then uses the normal text
handling routines to load the texts one by one at 204,
10 the keywords list is read, add words are added to it at
206 and delete words are eliminated from it if found at
208. The text is then scanned on a word by word basis,
the words of the text being compared to the "scan
words" at 210. When a match is found, the scan word is
added to the keyword list at 210. By user-selected
alternatives, when a match with the scan word is found
a) a preliminary check can be made to find whether some
designated additional scan word(s) are found within a
designated proximity, and only if so is the scan word(s)
20 added to the keyword list, and b) when a match with a
scan word is found, some other designated word can be
added to the keylist. When the scan is completed, if
any changes have been made at 212 in the keylist the
files are updated at 214 using normal text-saving proce-
dures. At 216, a che::k is made if there are any more
texts to be handled. If there are no more texts, the
routine terminates at 218, otherwise the next text is
handled starting at 204.
In the case of automatic keywording of an
30 "imported" file, the implementation reads that file from
the disk, creates a new empty keyword list for it, modi-
fies that keyword list as in the above paragraph, and
then saves it in the normal fashion.
The result of the "**" scan is particularly
noteworthy: using it, all texts in the textbase can be
keyworded retroactively, so that any mention made (in a
text) of a subject which subsequently becomes a

-- 32 --

lX76728

"category" (i.e. is keyworded somewhere in the textbase)
will subsequently be recognized and found when the sub-
ject is searched for. Similarly, the "**" scanning of
an imported text assures that any subjects mentioned in
the imported text which have already been keyworded
somewhere within the textbase, will be keyworded in this
imported text also~

Automatic Modification of Search Request by
Substitution of "Equivalent" Words
As illustrated in Figure 12, in the process of
the preparation of the search request in Figure 5, one
embodiment of the invention enables the user to cause
the search request he enters at 300 to be modified auto-
matically in several ways.
At 301, the program inspects the search
request to determine whether the user has indicated (by
means of an appropriate symbol) that "equivalent" words
or combinations of words may have been previously
defined to the system. (This technique is useful both
to permit a single word to represent a complex and oft-
repeated search request, and to provide the means for
providing automatic "equivalence" between the habitual
keyword vocabularies of different users of a common
textbase system.)
If a word in the search request is proceeded
by such a symbol (in this implementation a dollar sign
was used), then the word (including the proceeding
dollar sign) is searched in the keyword file 46. If
found at 302, the text keyword file 48 is scanned to
locate the associated text. The text, if one is found,
is not displayed through the normal display process,
rather the entire text is taken to be a redefinition of
the word which had been preceeded by the dol]ar sign,
and is substituted for it in the user's search request
at 303. The search request is then again made available
to the user for further editing at 300 or for him to
_ 33 _

~276728

repeat his command that the search request be processed.

Automatic Modification of Search Request by
Substitution of Similar Words
In similar manner, another symbol, the
asterisk, is employed when the user wishes his search
request to be expanded to include all the keywords in
the system which are similar to words in his search
request. At 304 the program inspects the search request
to determine whether any of the search words begin or
end with an aQterisk. If so, the similarity checking
routines (described below and in Figures 13 and 14) are
invoked to find all the keywords in the keyword file 46
which are similar to the given word in the search
request. If a list of similar words is found at 305,
the words are separated by the word "or", the list is
enclosed in parentheses, and the whole is substituted
for the original word at 303 in the users search
request. Here too, the modified request is again pre-
sented to the user for further editing, or for his com-
mand to proceed with the processing the request 300.
Once all requested substitutions have takenplace, processing proceeds to the locating of texts
matching the search request description.
Illustrated in-Figures 13A-B is a process in
accordance with the principles of the present invention,
which enable words in a target word list to be identi-
fied which are similar to the key words. Indeed, as
previously discussed, this aspect of the invention will
have application in several other uses, such as spelling
checkers and the like. In one implementation, the
invention is part of an information retrieval system,
where it is used to find key words related to words from
a search request provided by the user (whether similar
words, or misspellings of the same word).


- 34 -

1276728

The source word, i.e., the word to which we
intend to find similar words, is inpu~ to the process at
342. If it has a suffix, a second copy of the word is
made without the suffix 344,346 and this new version of
the word is kept for later use.
The program now fetches a word from the target
dictionary at 348, and the original word (including suf-
fix) is compared to the target word at 350 by the pro-
cess described below and in Figure 14. If that
comparison yields a score of zero or less than zero at
352, the program then checks at 372 if there is another
word in the target dictionary to look at, and if there
is such a word, fetches it at 348 and continues the com-
parison process.
If the first comparison results in a score of
more than zero, the program now compares the target word
with the source word at 354. The comparison is repeated
here because the process is essentially asymmetric - the
i first comparison at 350 checked whether the letter
groupings of the source word are to be found in the
target word; the second comparison checks whether the
letter groupings of the target word are to be found in
the source word.
The scores resulting from these two comparisons
are now added together and the total score examined. If
the total is very low (below 200 out of a possible 2000
at this point), the comparison is abandoned and the
program continues to examine the next target word at
348-352. If the total is very high (above 1900 out of
2000), the next stage in the comparison process is
bypassed as being unnecessary and the program continues
directly to calculating the average score at 366.
Alternately, at 358,360 if the total score falls
somewhere between the above cut-off values, the target
word is examined to see if it has a suffix; if so, the
suffix is removed. If either the source of the target
had a suffix, at 362 the suffix-less copy of the source
_ 35 -

1276728

word is now compared with the suffix-less copy of the
target word at 364 and the score resulting from this
comparison added to the total score. An average is now
calculated at 366 for all the comparison which have
been carried out, and if 368 that average score is above
a set threshhold (400 out of a possible score-of 1,000),
at 368,370 the word is added to the list of similar
words found.
At 372, the program now checks whether there
are any more target words to which the source word
should be compared; once the whole target dictionary has
been scanned in this way, the list of similar words is
sorted into descending numerical order at 374 and the
list of words (cut off at some convenient threshhold)
is returned to the user at 376 for further editing and/or
use at 300.
Three comparisons are thus made: the first
comparing the source word to the target, then comparing
the target to the source, and finally comparing the
source to the target where both words have had their
suffixes removed Depending on the scores reached at
each stage in the process, further comparisons are
halted if a very low score is received, or bypassed in
the case of a very high score. (Halting or bypassing
comparisons results in very quick processing in the case
of clear-cut similarities or differences, and further
comparisons are made only if justified by apparent simi-
larities between the words.) The resulting average
score is a balanced total of the three different com-
parisons. The threshholds set at each stage in the analysisare quite arbitrary and may be set to various values
depending on the needs of a particular application.
Likewise, the different comparisons can be weighted
if it is desired to emphasize one comparison over the
other, and the words can be examined with prefixes
removed as well as, or instead of, removing the suffixes.

- 36 -
,~

1276728

The comparison routine which carries out the
comparisons and allocates the points used in calculating
the scores is described in Figure 14.
The first letter is taken from the source
word, and from the target word at 378,380. At 382,
these letters are compared to each other. If the letters
do not match, then at 392 the program keeps trying
to find a match by taking one more letter at a time from
the target word until there are no more target letters.
The next source letter is then taken, and compared to
each target letter in turn, and so on until there are no
more source letters to compare.
Once a match is found between the source and
target letters, that match is counted at 386, and the
following two letters from each word are compared at
388. The process repeats until two letters are found
which do not match each other.
The comparison score is then calculated at
390, based on the number of consecutive letters which
matched in the two words. This score is obtained from a
table which converts the number of matching letters to
the appropriate score value. The start of this table is
shown below:

number of matching letters 1 ¦2 ¦ 3 ¦ 4 ¦5 ¦ 6 ¦ 7 . . .
point score 0 ¦1 ¦3 ¦6 ¦10 ¦ 15¦ 21. . .

These values are calculated on the basis of
one point for each subgroup (pair, triple, quadruple,
etc.) of letters contained in the group of matching
letters. A pair contains only one pair, and so is allo-
cated 1 point; a triple contains one triple and twopairs, and so is given 3 points; a quadruple contains
three pairs, two triples and one quadruple and so is
allocated 6 points. Different weighting systems can
easily be implemented depending on the particular needs
of a given implementation, to emphasize different sizes
~7
- 37 -

~76~728

of letter groupings.
This process is continued until the entire
source word has been scanned at 394. Once the word has
been scanned letter by letter in this way, a total
weighted score is calculated at 396. This total is
obtained by adding together all the subscores generated
during the comparison, and dividing them by the total
score possible based on the length of the source word.
This highest possible total is simply the value found in
the scoring table for the length of the source word
itself, as this is the value that would have been found
by comparing the source word with itself. In this way,
the score is adjusted for the length of the word so that
the same score will be obtained for words of comparable
similarity, no matter their lengths. This final score is
multiplied by 1000 in order to convert it to an integer
value between 0 and 1000.
The suffixes which this implementation looks
for and removes at 346,360 for the final comparison are
"ing", "ed", "er", "e", "tion", "al", "s", and "ly".
These suffixes are removed so that gram-
matically related words (verb forms, adjectives,
plurals, etc.) will be found by the program; a different
list or a list of prefixes could easily be substituted
to suit the needs of a different implementation.

Similar Sentences
As identical technique may be used to compare
word phrases or sentences rather than individual words.
This procedure is detailed in Figure 15. An input sen-
tence is obtained at 431, and stripped of any wordswhich appear on a list of "noise words" at 432,433. A
target sentence is obtained at 434 from a list of sen-
tences describing the texts in the database, and the
original source sentence is compared with this target
sentence at 435, using the method described below and
illustrated in Figure 16. Depending on the score thus
- 38 -
~;

1276728

obtained at 436 the program may then continue to compare
the target sentence with the source sentence at 437 and
depending on the average score so far obtained at 438
will then strip the target sentence of any words
appearing on the "noise list" at 439,440. If either
source or target sentence had words appearing on this
"noise list" at 441, the source sentence is compared a
third time to the target sentence, this time where
neither sentence includes "noise words" 442 The total
scores of all these comparisons is now averaged at 443,
and ir the average score lies above a predetermined
threshold at 444, the target sentence is added to the
list of similar sentences found at 445. This procedure
is repeated until no more sentences exist in the list of
target sentences being examined at 446; at this stage
the list of similar sentences found is sorted at 447.
At 448, the sorted list is either returned to the user,
or is used by the program to control the selection
and/or display of texts. In the implementation
described above, the list is used at 90g, where a group
text is divided into two sub-groups depending on whether
each text's comparison score falls above or below a
given threshold.
The mechanism of the sentence comparisons
carried out is detailed in Figure 16. Here a word of
the source sentence at 449 is compared to successive
words of the target sentence at 450,452 until matching
words are found at 451. Once a matching word is found,
the number of successive matching words in the two sen-
tences is counted 453,454 and a score calculated, based
on the number of matching words 455, using a table as
was described above for the calculation of scores at
390. This process is repeated until all words of the
source sentence have been processed 456, at which point
a total weighted score is calculated at 457 based on the
total score achieved as a proportion of the total

- 39 -

~X~76728

possible score for an identical match.
It should be noted that the "sentences"
referred to in the paragraphs above may be, but are not
necessarily, grammatical natural language sentences.
The procedure is also applied to "sentences" which are
actually the keyword list provided by the user when he
describes the text on saving it in the textbase.
Moreover, the request might be a collection of words in
a predetermined order having no sentence structure. The
program will then search for this collection of words
appearing in the specified order within an area of the
text. The area of the text might be limited to a prede-
termined sub-area of the text such as the title,
abstract, paragraph, etc. or within a certain number of
words. This feature enables the program to distinguish
between areas of text having the same words but an
entirely different meaning.
It is to be understood that even though the
above numerous characteristics and advantages of the
invention have been set forth in the foregoing descrip-
tion, together with details of the structure and the
function of the invention, the disclosure is illustra-
tive only, and changes may be made in detail, within the
principle of the invention, to the full extend indicated
by the broad general meaning of the terms in which the
appended claims are expressed.




- 40 -

~ .

Representative Drawing
A single figure which represents the drawing illustrating the invention.
Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee  and Payment History  should be consulted.

Administrative Status

Title Date
Forecasted Issue Date 1990-11-20
(22) Filed 1987-12-04
(45) Issued 1990-11-20
Deemed Expired 1993-05-22

Abandonment History

There is no abandonment history.

Payment History

Fee Type Anniversary Year Due Date Amount Paid Paid Date
Application Fee $0.00 1987-12-04
Registration of a document - section 124 $0.00 1988-10-14
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
KLEINBERGER, PAUL J.
TNET, INC.
Past Owners on Record
None
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Representative Drawing 2002-03-11 1 7
Drawings 1993-10-14 15 296
Claims 1993-10-14 6 216
Abstract 1993-10-14 1 27
Cover Page 1993-10-14 1 10
Description 1993-10-14 40 1,677