Note : Les descriptions sont présentées dans la langue officielle dans laquelle elles ont été soumises.
~3~
-
METHOD AND APPARATUS FOR CLASSIFYING
RAW DATA ENl~IES ACCORDlNG TO DATA PATTERNS
BA(~K(~R()UND OF T~IF INVFl~I~ON
The present invention relates geneMlly to a system for classifying large
volumes of raw data and, in particular, to a method and appaMtus for creating anindex to optimally classify a pluMlity of data entries for efficient review,
inspection, stoMge, sorting, e and reporting.
Many enterprises geneMte large volumes of data entries, each such entry
having associated with it one or more items Ic~l ~, important i,.r(",.,-,;...~
Typically, larger enterprises are ~: 1 ' and the data entries are inputted into
a computer and then transferred to a mass storage device such as a magnetic tapefor later use in various data analysis and report geneMting progMms. For example,
15 a receiving department or a stock room accepts a delivery of parts to be placed into
inventory. At least the part number and quantity are items which must be inputted
into the computer for later use in inventory"~-, r~..l.,.;~g and accounting
programs.
The prior art is v, ' Indexes are described as primary vs.
20 secondary; as dense vs. sparse; as single-level vs. multi-level; and, as static vs.
dynamic. Regardless of the method of index, the target of the index reference isone specific item. That one specific item may be a specific record, the first
occurrence of a specific value of a given field, the disk sector on which the data
is found, or something else of a singular definition. The target is usually defined
25 on (directed to the index from) a single field of a file. "F~ of Database
Systems" by Elasri and Navathe devotes chapter 5 to index structures for files.
The U.S. Patent No. 5,212,639 shows a method and electronic apparatus for
the ~ ri. .1,~". of . ' ' data for the and/or tabulation
thereof. An apparatus and a method for creating a database wherein tne data entry,
30 such as a journal entry in a l~, is the canonical record can be imr'~ --
using a ~ o, . A plurality of data entries to be classified are separate
records each comprised of one or more items having associated quantities and an
~ 8~4
.
entry number serving as a pointer to the record. Each item contains ;"r.."..";...~
including at least an item number or label and a quantity. A mapping function isapplied to each data entry to assign item indicators for the item numbers based
upon the associated quantities. The item indicators for the data entry are sorted
5 into ascending numerical sequence and an n-dimension sparse matrix is selected where "n" is the number of items in the data entry. If the present ' of
item indicators is new, a design record is created for the database based upon the
sparse matrix and including the item indicators, the associated quantity sums, the
total number of data entries ' in the design record and a pointer (a chain
o of entry numbers) to the records of the data entry detail. The quantities for the
presen~ data entry are added to the quantity sums and the entry number is storedin the pointer chain. After all the data entries have been processed, a search
routine can be utilized to review the various design records as desired.
The U.S. Patent No. 5,390,113 shows a method and electronic apparatus for
15 pclfu~ , bol-kk~~ A method and apparatus for el~ ly p, r ~ ~
kk..~ f upon a plurality of accounting joumal entries having at least one
account number and at least one data component associated with each account
number. A chart of accounts having account numbers and opening balances
associated with the plurality of joumal entries is read electronically. A set ofaccount-section numbers is then created for each account number. The joumal
entries are el~L1, lly read and one of the account-section numbers is assigned
to each account number. The assigned account-section numbers and associated data., are then sorted in a l)1cd~; ' order. A design for the
order is identified and compared with stored design records to see
if such a design already exists. If not, the new design is stored. If so, the
associated data ~ are added to the - ' ~ total for each account-
section number. A tally ICLn~ illg the number of account-section numbers is
increased by one and an entry number is added to a list for the particular design
record. The process is then repeated for each joumal entry.
The use of a database to summarize joumal entries in a way that can be used
for audit planning and ?~ data for matrix-defined financial statements is
2~8~
described in an article entitled "An Innovation in Auditing: Matrix Summaries ofJournal Entries", ABACIJS, Vol. 2g, No. 2, 1992.
SUMMARY OF THE INVENTION
An index classifies material in such a way as to make it easier to find. The
present invention relates to methods of creating an index for the storage and
retrieval of electronic data. The present invention differs from the prior art
indexing systems in that the source and the target is a group of items rather than
o an individual item, and that the data pattern of the group defines an individual
record of the index. Under these conditions the index records are of variable
length (multiple fields) rather than a single field.
The present invention concerns an apparatus and a method for creating a
database wherein the data entry is the canonical record. Raw data, typically
15 provided from a mass storage device, is formed of a plurality of data entry records.
Each record can include one or more key fields each containing and item of interest
to be indexed. Typically, two or more of these data entry records are related asa group and these groups are to be classified or ' A master list of all
of the items in the key fields to be found in the raw data is either included with the
20 raw data or created by the method and apparatus according to the present invention.
The present invention can be used to index new records and to . ,'
existing files. Use to index new records includes (I) the ~l~ccifi~fion of data for
schedules or quality reports and (2) ", ~ of a "class" of items in object-
oriented l~lUL ~ ' ' usage would include use as primary or
25 secondary indexes to (1) a log file and (2) a relational database table. New record
usage involves dynamic operation whereby the index becomes part of the process
of selecting where to place newly created data.
Another feature of the index according to the present invention is
of program branching. A "Pl" pointer located in each index record
30 allows the ~ , to specify program action for given patterns of data using
additional i~ liùAs. This feature is expected to simplify some programs.
~183~
BRTF.F DFA~GRTPTION OF TT-TF DRAWINGS
The above, as well as other advantages of the present invention, will become
readily apparent to those skilled in the art from the following detailed description
s of a preferred ~,I..I)o~' ' when considered in the light of the - . Y;l 6
drawings in which:
Fig. 1 is a schematic perspective view of a three-~ Pnc;~-n~l matrix
according to the present invention;
Fig. 2 is a schematic ~ lio-l of the data pattern structure schema
according to the present invention;
Fig. 3 is a block diagram of the apparatus according to the present
invention; and
Fig. 4 is a flow diagram of the method according to the present invention.
DFA~(~RTPTION OF TTTF pRFFFRRFn FMPODTMF,~T
In general, raw data in the form of a plurality of data entry records is only
useful if the records can be sorted based upon items of interest in the records. A
significant savings in data processing time can be achieved if the data entry records
can be indexed or classified by "key" items and such index stored for use in
subsequent searches, summaries and reports. A "group" can be defined as one or
more data entry records having a common 1~ 'a" ' ,i. A group will have one or
more key fields in which items of interest (key items) are stored. For example, a
simple sales transaction can be recorded utilizing a first data entry record for the
2s customer ;,~ . r " 1~ a second data entry record for the product sold, and a third
data entry record for the sales price. These three data entry records are related as
a group. If the customer ' ~ is the only item of interest, this group has
one key feld (in the first record). If the product sold also is of interest, this group
has two key fields (in the first and second records).
A group having n-key fields can be indexed according to the present
invention by utilizing an n-dimension matrix with up to Mn values stored where
"M" is the number of items in a master list of key items for all data entry records
- 21$3~
in the raw data. However, not all key item ~u ' - may occur in any given
set of raw data. This allows the use of a series of sparse matrices, which require
space to be allocated for only those key items which actually exist. For the
purposes of i~ ctr~fi--n it is assumed that a plurality of groups of raw data entry
s records, each group having three key items of interest, are to be classified.
Potentially then, a three ," ' matrix, such as the one shown in the Fig. 1,
could be used to index such raw data.
There is shûwn in the Fig. 2 the data pattern structure schema of index
records according to the present invention for each of one-dimension, two-
o dimension, three-dimension and four-dimension ~ of key items to be
indexed. Each size (number of key fields) group of data entry records requires aseparate matrix. The index records include a variable length record pûrtiûn, having
a key field for each key item to be indexed in the associated group, and two pointer
fields which are discussed below.
An example of a three-dimension matrix 21 which uniquely classifies grûups
of data entry records each having a, ' of three key items from a master
list of ten items is shown in the Fig. 1. The matrix 21 can hold up to one thousand
values, or different ~ of the three items in the groups, which is the total
number of cells in the matrix. The items used in the example of the Fig. 1 are
20 designated by item indicators as the first ten letters of the English alphabet. The
first item of the group is classified on a "Y" axis 22, the second item is classified
on an "X" axis 23, and the third item is classified on a "Z" axis 24 of the matrix
21. Thus, a group of data entry records having key items "ADE" results in an
;l~f~ at a cell 25 and a group of data entry records having key items "JJA"
25 results in an : at a cell 26.
A dictionary is a type of index wherein only those; b ~dliùl~S of letters
of the alphabet which have meaning, words, are of interest. The dictionary indexes
those letter ~nC in a p.cd~l ' sequence. Thus, a dictionary is not
an efficient index if all words containing ~ el .' letter ~ "~ - such
30 as the letters "a" and "b" for example, are to be located. Clearly, an index can be
~o..~llu~t~ according to the present invention to efficiently r ' ' such
search requests.
s
~1~3~
Assuming that a master list of key items exists, the matrix design process
for each n-dimension ~ù ' involves the steps of: (a) reading a data entry
records group from the source of raw data, (b) ~ the data pattern
structure schema of the data entry records group in terms of the number of key
s fields to be classified, (c) creating an index record for the key item combination,
(d) mapping the key items to generate item indicators in key fields in the indexrecord, and (e) recording a pointer in the index record to enable the data entryrecords group to be located in the raw data. The result is a ~ ' length
index record for each unique data pattern in the groups of data entry records. After
the raw data has been processed in this manner, each data entry records group can
be located. Thus, the cells in the matrix shown in the Fig. 1 represent the
of item indicators for the index record key fields ~;UII~_r '- ~ to any
group of data entry records containing three key items from the master list of "A'
through "J".
When the sequential ' ' 1~ of items within the master list is fixed, the
item indicators substitute for the subscripts of matrix notation. The sparse matrix
IJIU3, ,, technique requires only those records which exist to be stored in
computer memory. Vacant positions or cells in the matrix shown in the Fig. 1
require no computer memory.
The index constructed according to the present invention includes index
records which identify the data patterns of the groups of key items drawn from amaster list. All that the process requires is that a master list be in computer
memory at the time the index is created. The form of master list file, whether
sequential, array, B+ tree, or some other Ul~ .~..;~l~iUII, is not important so long as
25 each master list item is uniquely identified via the mapping function in the index
records. rulL~ llul~ the master list can be constructed dynamically. One needs
primarily to avoid the problems associated with deleted master records in some file
structures. However, the choice of mapping function and file, " ~n of the
master list does affect the speed of operation of the process, the comparability of
30 index records over time, and whether the process can be traced in one or both directions. The index always identifies a group data pattern.
3~24
The group can be the sequential items of i as are found in a log
of invGices; or, the items reported in a summary, such as treatments recorded inhospital cases. The method according to the present invention requires only thatall items of the group be available at the time of data pattern creation.
s Data pattern is the term applied to the key items and/or the associated item
identifiers of all the data entry records of a group. For indexing purposes in
business or social science ~I~J~ the order in and/or the number of times
which the key items appear in the group of data entry records usually is not
important. Thus, in the above described sales transaction example, if we assign the
letters "A" and "B" to the key items of interest, it would not matter for the
purposes of indexing whether the order in the group was "AB" or "BA" and it
would not matter whether the group was "AAB" or ABB" since the same two key
items a~e present in each of these . ' Thus, the key items can be sorted
and duplicates eliminated before mapping so that synonyms are not separate, i.e.,
"AB" and "BA" will map to the same index record and be considered the same data
pattern. An individual item indicator (the mappcd value of the key field) will be
used only once in the c--~hinq~-n which becomes the index record even though therelated item appears more than once in the group of data entry records. In otherwords, duplicate O~,-,UII~ ,~ are combined. The ." of duplicate
0~ UI1~ reduces the size of the index and increases the speed with which the
raw data can be processed. If an item being mapped does not match any item of
the master list, an error message is generated before the data pattern is included in
the index
In business or social science, a data pattern is the verified, non-duplicate list
of item indicators of the co ' being mapped, typically in ascending
sequence. Other disciplines may require the index record to reflect the sequenceof the key fields as they appear in the group. Both types of index records may be
required for the same raw data. For example, the ~ of a bolt, washer
and nut can be key items of group of raw data entry records. For the purposes ofpurchasing and inventory, the order in which these items appear in the group is not
important. For the purposes of assembly, the order "bolt", "washer" and "nut"
would be important.
~83~
In the Fig. 3, there is shown an apparatus for "~.r, " the method of
creating the index records. A source of il~rullllatiu,. or raw data entry records 31,
such as another computer program or data i from another computer or
peripheral data source, has an output connected to an input of a computer 32. Tbe
5 source 31 may include a master list of items in key fields of the i " r,.. ., ~ . . being
processed. The computer 32 reads the master list and stores the master list
- r " in a master list memory devioe 33 connected to the computer. If the
master list is not present in the source 31, the computer must create the master list.
Once the master list has been stored, the computer 32 does not have to read the
master list each time additional raw data containing items from the master list is to
be processed.
The computer 32 reads the items in the key fields of a group of data entry
records from the data source 31. If the order in which the items appear in the key
fields of the group is not significamt, these key items are sorted into ascending
sequence and duplicates are removed. The computer 32 can identify the unique setof key items associated with a specific co".l. ~ If the set of key items is new,an index record containing the d~lU~)I' ' number of key fields is created. If the
set already exists, the computer performs the U~,LiU.~ provided for by a first
pointer in the l,UllUi~LJOl~ll;l.g index record. For example, the instruction could be
20 to store the current group in a file thereby - " all groups having the same
data pattern in a common file. In either case, the index records are stored in an
index record memory 34 connected to the computer 32. Each index record also
includes a second pointer which is used to locate the associated groups for
subsequent processing of the data entry records. An output device 35, such as a
25 printer or a CRT, also is connected to the computer 32 for generating output
l~rulll~ iu.. related to the index.
Data entry records already placed at locations deflned in each index record
may be inspected via a search routine. The se~rch may be according to all of thekey items of the l;UI '' " or any portion of the key items. Less than the
30 complete ~u ~ l requires additional queries to find the specific index record.
Once the proper index record is found, the operator may inspect data entry records
at the data en~ record locati~ ~ec _rec~.
211 ~2~
The hardware associated with this method of indexing is modest. It may
consist of portions of an existing computer program, as a separate computer system,
or as a small computer installed in the circuitry of a larger computer.
There is shown in the Fig. 4 the f~ow of the method of .;o.~ u~L..~, an index
s record. As each key field of the group of data entry records to be indexed
according to the data pattern is read, a mapping function is applied to assign the
item identifiers. An item identifier defines an item of the master list. If the
present data pattern is new, a new record for that pattern is created in the index.
After each data pattern has been processed to the index, the process can be
o interrupted. Thus, usage of the index may be either static or dynamic.
The process of creating and using the index records is:
A. Read the key fields of all data entry records groups of a given ~iu...l,:..aLon,
i.e. a transaction (as in business) or a case (as in hospital records), and tally the
total number of items in the ~ '
B. Determine whether key fields are to be sorted and duplicates removed. If
not, proceed to step E.
C. Sort the key fields read in step A. into ascending sequence.
D. Eliminate amy key fields of duplicate items and reduce the tally by the
number of duplicate values eliminated.
20 E. Create a variable lengtb portion of an index record where the length is the
number of index record key fields equal to the tally and apply the mapping function
to each item in the group key fields to store ~-~rri~crnn~' 1,, item indicators in the
index record key fields.
F. Determine whether the item indicators generated in step E. require a new
25 index record. If not, go to step G. If so, complete the index record by
hi.~ two pointers (P1) from the new index record to the computer location
of additional illol~ liOI~o to be processed when that particular c~ iS
d and (P2) from the new index record to the computer location of the
object of the index (location of new data storage, or location of a ~ ccifir~jon of
30 data).
G. Perform user-defined procedures to be used with the specific . ~illa~iO
as specified by the pointer Pl.
3~24
H. Wait for a signal to (1) apply the index to a new group, (2) begin a search
using the index, or (3) cease data entry record indexing and go to another program
function.
In the flow diagram of the method shown in the Fig. 4, the computer 32 of
5 the Fig. 3 is l.-u~ J to perform a sequence of i~DllU.,liu..~ beginning at a
circle 61. The computer program then enters a decision point 62 to determine
whether a master list of items to be indexed exists in the source 31. If the master
list exists, the program branches at "YES" to an instruction set 63 wherein the
master list is read and stored in the memory 33. If the master list does not exist,
the program branches at "NO" to an instruction set 64 wherein the master list iscreated and stored in the memory 33. The master list can created by reading and
storing the items in the key fields of all of the data entry records groups before
continuing with the program. In the alternative, the instruction set 64 can direct
the program to create the master list as the data entry records groups are being15 processed as described below.
The program then enters a wait loop 65 from both of the instruction sets 63
and 64. The programs idles in the wait loop 65 until receiving a signal that theraw data is ready to be processed. When an instruction set 66 receives the signal
that a group of data entry records from the source of raw data 31 is ready, the
20 computer program enters an instruction set 67 which causes the key fields of the
group to be read into computer memory and a tally of the number of key fields tobe created. If duplicate key fields are to be eliminated, a sort switch is set. If the
master list is being created, the items in the key fields are stored in the master list
memory 33. The program then enters a decision point 68. If the sort switch is set,
25 the program branches at "YES" to an instruction set 69 wherein the key fields are
sorted into ascending sequence, duplicate key fields are eliminated, and the tally of
key fields eliminated is subtracted from the tally created in instruction set 67.
The program then enters an instruction set 70 to create the variable length
portion of the index record (see key fields in the Fig. 2) where the length is the
30 tally of the number of key fields in the specific COIl~ la.~iull. A mapping function
is applied to each key field of the group to generate the item indicators in the key
fields of the index record. The instruction set 70 also is entered from the "NO"
3~24
branch of the decision point 68 when the sort switch is not set. If an error occurs
in the mapping function, a "YES" branch is made from a decision point 71 and an
indicator is set in an instruction set 72 for possible ~, The program then
returns to the wait loop 65. If no error occurs in the mapping function, the
s prograrn branches at "NO" from the decision point 71 to an instruction set 73
wherein the computer 32 searches the existing index records in the memory 34 to
determine whether the exact I ' of key items being processed already
exists. If the variable length portion of the index record being CU..Di1U~,lCd is new,
the prcgram branches from a decision point 74 at "YES" and a new index record
is created in an instruction set 75 by (a) extending pointer chains from each master
list item of the new record to include the new record, (b) ~b'- ' of a pointer
"P2" from the new index record to the group or file which is the object of that
record, and (c) the inclusion of a pointer "Pl" to any special i.,,~u.,liu." to be
associated with that index record. The program enters an instruction set 76 to
15 execute ill,llu~ilio.., referred to in the "Pl" field of the index rccord and then
continues to the wait loop 65. If the pattial index record under ~:oll~LlUU~iUII iS not
new, the program branches from the decision point 74 at "NO" to the instruction
set 76 and continues to the wait loop 65.
At any time the program is in the wait loop 65 a branch can be made to an
20 instruction set 77 to search the existing index records. After the search is
completed, the program returns to the wait loop 65. The program also has an
instruction set 78 which can respond to an indication that no more groups of data
entry records are to be indexed. The program then exits through a "NO" branch
of a decision point 79 to completion at a circle 80 or through a "YES" branch to25 an instruction set 81 if another computer program is to be executed.
The master list controls how the groups of data entry records are indexed.
More than one master list can be utilized with the same raw data entry records to
permit specialized searching and processing. The index records created from the
groups of data entry records reduce the time required for searching and organizing
30 the raw data.
In ac~u.~' with the provisions of the patent statutes, the present invention
has been described in what is considered to represent its preferred ~illlbu~'
11
2~
However, it should be noted that the invention can be practiced otherwise than as
specifically illustrated and described without departing from its spirit or scope.
12