Language selection

Search

Patent 1255809 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent: (11) CA 1255809
(21) Application Number: 1255809
(54) English Title: SYMBOLIC TOKENIZER FOR WORDS AND PHRASES
(54) French Title: ABREVIATEUR SYMBOLIQUE DE MOTS ET DE PHRASES
Status: Term Expired - Post Grant
Bibliographic Data
Abstracts

English Abstract


SYMBOLIC TOKENIZER FOR WORDS AND PHRASES
ABSTRACT OF DISCLOSURE
A microprocessor capable of high speed generation of
tokens. The tokens are binary symbols which represent
language words and phrases. The tokens may be manipulated,
stored, or transmitted, instead of the language words or
phrases themselves, thereby realizing a greatly increased
efficiency due to the statistically reduced number of
bits required to represent individual language words.


Claims

Note: Claims are shown in the official language in which they were submitted.


THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE PROPERTY
OR PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:
1. A method of tokenization of trial words using a stored
dictionary of words, the stored words each having locations
relative to the others in the dictionary, the words in the
dictionary being grouped into segments with words of the same
length grouped in a common segment and words of different lengths
grouped in different segments, the words in each segment having
a plurality of portions making up the complete word, the method
comprising the steps of:
selecting a segment in the stored dictionary, the words
of which have a length with a predetermined relation to the
length of a given said trial word undergoing tokenization;
comparing bytes of one of the words in the selected
segment with bytes of said given trial word for a match; and
forming a token indicative of the relative location in
the dictionary of a word, compared and found to have the match
with the trial word.
2. The method of claim 1 wherein the segments are further
separated into groups, and each group has words each having a
portion bearing a predetermined relation to a corresponding
portion of the other words in the same group, and
wherein the step of selecting further comprises the
step of selecting a group of the words within the selected
segment for comparison, whose said portion bears such predeter-
mined relation to a corresponding portion of the given trial
word.
3. The method of claim 1 comprising the step of enabling
the step of comparison to be repeated when there is a lack of the
match with at least one of said bytes, comparing the same given
trial word and the bytes in a next one of the words in the stored
dictionary.
- Page 1 of Claims -
89

4. The method of claim 1 comprising the steps of:
enabling the step of comparison to be repeated, with
the bytes of other said words in the same said selected segment
until the match is found with all of the bytes in one of said
words in said segment under comparison; and
forming such token indicative of the relative location
of the word in the stored dictionary whose bytes are compared and
found to have the match with the given trial word.
5. The method of claim 1 wherein the step of forming a
token comprises the steps of forming a coded value having a
length and reducing such coded value to another coded value that
is shorter than the coded value.
6. The method of claim 5 wherein the step of forming the
token comprises the step of adjusting the value represented by
such another coded value by a predetermined amount.
7. The method of claim 5 wherein the step of reducing such
coded value comprises the step of combining a value representing
the number of segments in the selected segment with such coded
value.
8. The method of claim 7 wherein the step of combining
comprises the step of dividing the coded value by the number of
segments in the selected segment.
9. The method of claim 6 wherein the step of adjusting
comprises the step of adding an offset value as a function of the
relative location of the segment containing the matched word in
the dictionary.
10. The method of claim 3 wherein the step of forming the
token comprises the steps of forming an indication of the
location of the word in the dictionary that is found to match.
- Page 2 of Claims -

11. The method of claim 10 wherein the step of forming an
indication of the location comprises the step of incrementing the
indication to an indication of the location for a next one of the
words in the same segment of the stored dictionary when there is
a lack of the match.
12. The method of claim 4 comprising the step of forming
a value indicative of the number of locations for words in the
segment whose words are under comparison and terminating the step
of repeating in dependence on said value.
13. The method of claim 4 comprising the step of forming
a length value indicative of the length of the words in each of
said segments, and forming the value of said token as a function
of said length value.
14. The method of claim 1 comprising the step of substitut-
ing said token in place of at least one of said trial words in
a sequence of said trial words.
15. The method of claim 1 comprising the step of reading
out the stored words, the segments in a stored dictionary
contained in a read only memory for use in the step of comparing.
16. The method of claim 1 comprising the steps of providing
language words as the trial words and as the words in the stored
dictionary for use in the steps of selecting, comparing and
forming.
17. A method for compressing a series of words comprising
the steps of:
identifying a first parameter for one of the words;
identifying a second parameter for said one word;
using the first and second parameters for said one word
together to identify a memory field within a memory having
multiple fields containing stored words;
- Page 3 of Claims -
91

comparing stored words in the identified memory field
to said one word for a match; and
substituting for the matched word a token, identifying
the matched stored word.
18. The method of claim 17 wherein at least some of the
series of words and the stored words comprise representations of
language words and the step of comparing compares language words.
19. The method of claim 17 wherein the words comprise a
plurality of characters and the step of identifying a first
parameter comprises determining the location of the starting
character for said one word in the memory field.
20. The method of claim 17 wherein the words comprise a
plurality of characters and wherein the step of identifying a
second parameter comprises determining the number of characters
in said one word.
21. The method of claim 17 wherein the step of using the
first and second parameters comprises the step of determining a
starting address within the memory field and determining the
number of stored words in the determined memory field.
22. The method of claim 21 wherein the steps of determining
a starting address and determining the number of words comprise
the steps of using the first and second parameter to address and
read out, from a memory, a starting address and the number of
stored words.
23. The method of claim 17 wherein the step of comparing
words comprises the steps of:
loading said one word into a first register;
loading a first one of said stored words into a second
register;
- Page 4 of Claims -
92

comparing the contents of the registers for a match;
and
if there is a match, enabling the step of substituting
a token, and if there is no match, loading a second one of said
stored words into the second register and then repeating the step
of comparing for a match.
24. The method of claim 23 wherein the step of loading a
first one of said stored words and comparing the contents of the
registers are repeated and for each repetition another one of
said stored words is loaded into said second register and
compared until all of the stored words for the corresponding
first and second parameter combination have been compared, or
until the match exists, whichever occurs first.
25. The method of claim 17 wherein the step of comparing
words comprises the steps of:
loading said one word into the first register;
loading a portion of a stored word into a second
register;
the step of comparing comprising the step of comparing
the contents of a portion of the word in the first register with
the portion of the word in the second register for the match;
if the contents of the portion in the first register
and the contents of the second register match, then loading a
remaining portion of the stored word into the second register,
repeating the step of comparing using the remaining portion in
the second register, and,
repeating the steps of loading a remaining portion into
the second register and comparing using the remaining portion
until the entire one word has been compared to the entire stored
word.
26. A method for compressing a series of words comprising
the steps of:
- Page 5 of Claims -
93

comparing data representing a first word to representa-
tions of words stored in a first memory to determine whether a
token exists for a sequence of the words beginning with the first
data word;
if a token exists, in a second memory for a sequence
of the words, comparing data representing a sequence of the words
beginning with the first word to data representing a sequence of
words beginning with the first data word for a match; and
if the match exists, substituting a token identifying
the sequence of words from the second memory for the matching
sequence of words in the serial words.
27. The method of claim 26 further comprising the step of
repeating the step of comparing data representing a sequence of
words to data stored in the second memory until a match is found,
and then substituting a token identifying the sequence from the
second memory for the matching sequence in the serial data words.
28. The method of claim 26 wherein the step of comparing
comprises the steps of:
loading data representing a sequence of the words
beginning with the first word into a first register;
loading data stored in a second memory representing a
sequence of words beginning with the first data word into a
second register; and
comparing the contents of the first and second
registers.
29. The method of claim 28 further comprising the step if
the register contents do not match, of repeating the steps of
loading data into the first and second registers and comparing
the register contents until a match is found, and then substitut-
ing a token identifying the sequence from the second memory for
the matching sequence in the serial data words.
- Page 6 of Claims -
94

30. The method of claim 26 wherein the step of comparing
data representing a first word comprises the step of comparing
data representing a first word to representations of words stored
in a first memory to determine locations in the second memory of
stored data representing a sequence of words beginning with the
first data word.
31. The method of claim 26 wherein at least some of the
words comprise representations of language words.
32. A method for compressing a series of words comprising
the steps of:
comparing representations of a sequence of a predefined
number of the series of words with representations of stored word
sequences for a match;
if the match exists, substituting for the matched
sequence of words a token identifying the matched stored word
sequence;
If the match does not exist, eliminating at least one
word from the predetermined number of the series of words thereby
forming a lesser number in the predefined number of the series
of words and repeating the aforementioned steps at least once
using as the representations of a sequence the representations
of the lesser number of the series of words; and
if the match does not exist after the step of repeating
then further repeating selected ones of the aforementioned steps
using as said sequence in the step of comparing the predefined
number of the series of the words including the at least one
word that is eliminated.
33. The method of claim 32 further comprising the step of
repeating said step of repeating the aforementioned steps
eliminating a still further word from the series of words used
in the step of comparing.
- Page 7 of Claims -

34. The method of claim 32 further comprising the steps of
comparing words in the series of words to data representing a
table of tokens for a key data word for a sequence of data words
having the key data word.
35. A method for compressing a series of words comprising:
tokenizing each of at least a portion of the words;
comparing the tokenized portion of the words to a
stored group of tokens for a match; and
if there is a match, substituting a group token
identifying the matching stored group of tokens for the tokenized
portion of the words.
36. The method of claim 35 further comprising the steps of:
identifying a first parameter for one of the words;
identifying a second parameter for the same said one
of the words;
using the first and second parameters for said one of
the words together to identify a memory field within a memory
having multiple fields;
comparing words stored in the identified memory field
to said one word for a match; and
substituting a token for the matched word identifying
the matched stored word.
37. The method of claim 35 further comprising the steps of:
identifying a keytoken in the tokenized portion of
words;
comparing the identified keytoken with individual
keytokens in a stored list of keytokens; and
if a match occurs enabling the step of comparing.
38. The method of claim 37 comprising the step of using the
matched keytokens in the stored list of keytokens for determining
a pointer to a selected group of tokens, from a plurality of said
stored groups of tokens, for use in said step of comparing.
- Page 8 of Claims -
96

39. The method of claim 37 wherein the step of comparing
words stored in the identified memory field comprises:
loading a first tokenized portion of words having the
identified keytoken into a first register;
loading a first stored group of tokens having the
identified keytoken into a second register;
comparing the contents of the first and second
registers;
if the contents do not match, loading additional stored
groups of tokens having the identified keytoken into the second
register and successively comparing the contents of the first
and second registers until a stored group of tokens, which
matches the tokenized portion of words, is found.
40. The method of claim 35 wherein there is an initial
number of tokens in said group of tokens and comprising the step
of repeating the step of comparing, if there is no match, using
as said group of tokens a lesser number of said tokens than said
initial number.
41. The method of claim 39 wherein the step of comparing
words stored in the identified memory field comprises, after
comparing the first and second register, the steps of:
if no match is found, loading a second portion of words
having the identified keytoken and a lesser number of tokens into
the first registers; and
loading stored token phrases having the identified
keytoken into the second register and successively comparing the
contents of the first and second registers until a stored token
phrase which matches the second token phrase is found.
42. The method of claim 35 wherein the step of comparing
the tokenized portion of words comprises:
identifying the first token in a sequence of tokens as
a keytoken;
- Page 9 of Claims -
97

consulting a keytoken list to determine whether a
phrase token is available for phrases beginning with the
keytoken; and
if a phrase token is available, comparing token phrases
beginning with the identified keytoken only to stored token
phrases which begin with the identified keytoken.
43. The method of claim 42 wherein the step of comparing
groups of tokens beginning with the identified keytoken com-
prises:
loading a tokenized portion of words beginning with the
identified keytoken, said tokenized portion having a predeter-
mined number of tokens, into a first register;
loading a first stored group of tokens having the
identified keytoken into a second register;
comparing the contents of the first and second
registers;
if the contents do not match, loading additional stored
groups of tokens having the identified keytoken into the second
register and successively comparing the contents of the first and
second registers until a stored group of tokens, which matches
the tokenized portion of words, is found;
if no match is found, loading a portion of said
tokenized portion of words having a number of tokens lesser than
the predetermined number into the first register;
loading stored groups of tokens having the identified
keytoken into the second register and successively comparing the
contents of the first and second registers until a stored group
of tokens, which matches the portion of the tokenized portion of
words, is found.
44. The method of claim 35 wherein at least some of the
words represent language words.
45. The method of claim 35 wherein the tokenized portion
of words represents a sequence of language words.
- Page 10 of Claims -
98

46. The method of claim 35 wherein the at least a portion
of the group tokens represent sequences of language words.
47. A compressor for a sequence of trial words comprising:
a word dictionary for storing representations of words,
the representations being accessible according to a first and a
second parameter for each word;
means for determining the first and the second
parameters for each of the trial words;
means for accessing for each of the trial words, at
least one word in the dictionary which has the first and second
parameters thereof equivalent to that of the trial word; a
comparator for comparing each trial word and the at least one
accessed word with equivalent parameters for a match; and
a token generator for generating a unique token
identifying the accessed word which matches the trial word and
for substituting the token for the trial word in the sequence of
data words.
48. The compressor of claim 47 wherein the trial words and
the accessed words are of varying length and wherein the
comparator comprises means for comparing the accessed words with
the trial words for the match in a sequence of subparts where the
subparts of the accessed words are at least as long as the
subparts of the trial words undergoing comparison for a match.
49. The compressor of claim 48 comprising a trial register
for storing the longest of said variable length trial words while
such trial word is compared by the comparator.
50. The compressor of claim 47 wherein the trial words
comprise language words.
- Page 11 of Claims -
99

51. The compressor of claim 50 wherein the language words
comprise characters and the first parameter comprises the number
of characters in the corresponding language word.
52. The compressor of claim 51 wherein the second parameter
comprises the character occurring at a predetermined position in
the corresponding word.
53. A compressor for a sequence of words wherein the words
are of variable length comprising:
a word dictionary for storing words in association with
a token identifying each word, the dictionary storing the words
in grammar byte increments, each grammar byte corresponding to
a predetermined amount of data, the words being of variable
length so that at least some words are stored in one grammar
byte, while other words are stored in at least two grammar bytes;
means for determining the number of grammar bytes
required to store a word in the sequence which is to be com-
pressed;
compare logic for successively comparing the words in
the sequence to be compressed to grammar bytes retrieved from the
dictionary for a match;
means for determining the token corresponding to the
appropriate dictionary word for each dictionary word for which
the grammar bytes of the dictionary word match the grammar bytes
of the word to be compressed; and
means for substituting the token corresponding to the
matching dictionary word into the sequence of data words.
54. The compressor of claim 53 further comprising a trial
register in communication with the compare logic for storing a
trial word from the sequence of words to be compressed for
comparison by the compare logic.
55. The compressor of claim 54 further comprising means for
enabling the next word in the sequence to be compressed to be
- Page 12 of Claims -
100

loaded into the trial register after the means for determining
has determined a token for the trial word.
56. Tokenizer for a sequence of words wherein the words
vary in length comprising:
a dictionary for storing a series of words of varying
lengths arranged in varying numbers of grammar bytes, the grammar
bytes having a predetermined length;
means for determining the number of grammar bytes
corresponding to a word from the sequence of words to be
tokenized based upon the length of the word to be tokenized;
compare logic for comparing portions of dictionary
words, each portion corresponding to one grammar byte, portions
of words in the sequence to be compressed each portion correspon-
ding to one grammar byte;
a counter for counting the number of grammar bytes of
each dictionary word which are compared by the compare logic; and
means responsive to the counter for enabling the next
word in the series in the dictionary to be compared by the
compare logic.
57. The tokenizer of claim 56 further comprising a segment
register, in communication with the segment counter and the means
for determining the number of grammar bytes, for storing the
number of grammar bytes determined and supplying the number to
the segment counter.
58. The tokenizer of claim 56 further comprising means for
accessing data words in the dictionary, one grammar byte at a
time and supplying the data words to the compare logic, one
grammar byte at a time, the means for accessing being in
communication with the means for limiting so that no more than
the appropriate number of grammar bytes for each dictionary word
is supplied to the compare logic.
- Page 13 of Claims -
101

59. The tokenizer of claim 58 further comprising a trial
word register for storing a word to be compressed and providing
access to the portions of the word to the compare logic for
comparison.
60. A tokenizer for a sequence of words comprising:
an addressable dictionary for storing words grouped
according to common values for a parameter of the words, each
group of dictionary words having a continuous range of addresses;
means for determining the parameter value for a trial
word to be tokenized in the sequence of words;
a dictionary map indicating the starting address and
address range for dictionary word groups having an identified
parameter value;
compare logic for comparing dictionary words to the
trial word;
means for accessing dictionary words and supplying them
to the compare logic, the means being in communication with the
means for determining the dictionary map for accessing the
dictionary beginning with the starting address for a group of
dictionary words having a parameter value equivalent to the
parameter value of the trial word;
means in communication with the dictionary map, for
restricting the dictionary words accessed by the means for
accessing to those within the address range indicated by the
dictionary map for the appropriate dictionary word group; and
means for substituting a token representing a diction-
ary address identifying a dictionary word in place of each word
to be tokenized which matches a dictionary word as determined by
the compare logic.
61. The tokenizer of claim 60 wherein the means for
restricting comprises a counter in communication with the
dictionary map, for counting the number of dictionary words
accessed and generating a signal when a number of dictionary
- Page 14 of Claims -
102

words corresponding to the address range indicated by the
dictionary map has been accessed.
62. The tokenizer of claim 61 further comprising a trial
word register for storing a trial word and providing access to
the word to the compare logic for comparisons.
103
- Page 15 of Claims -

Description

Note: Descriptions are shown in the official language in which they were submitted.


~3~'5~
D144 :17062 :JAH -1-
SYMBOLIC TOKENIZER_FOR WORDS AND PHRASES
TECHNICAL FIELp
This invention relates to microprocessors and more
particularly to a custom microprocessor which permits a
conventional microprocessor to manipulate quickly and
efficiently entire language words and phrases as tokens
as opposed to manipulating single characters.
!
BACKGRUUND AR~
Conventional microprocessors process text by manipu-
lating binary values which represent single characters.
In order to process an entire word the conventional micro-
processor must se~uentially process each individual char-
acter of the word.
This invention is a custom microprocessor called the
symbolic tokenizer which permits a conventional micropro-
cessor to process language words and phrases in their
entirety instead of on a character-by-character basis.
This leads to greatly increased speed and efficiency where
the manipulation of text is involved,
A data compression method called toXenization is the
basis of this invention. Language words and phrases are
converted into tokens which utilize fewer bits than the
ASCII or EBC'DIC binary equivalents of the same words.
.
Q ~
' ' . .

~2~
1 The following United States patent documents were
considered in the investigation and evaluation o~ th.is
invention:
2,709,199 3,309,694 3,388,3~0 3,599,205
3,685,033 3,810,154 4,122,299 4,229,817
Patent 2,709,199 entitled "CODE SIGNAL CONVERTER~ is
an apparatus for converting signals o~ a code having a
given number of units into signals o~ a code employing a
different number of units. An example would be converting
signals from a five- to an eight-unik code~
This invention does not employ data compression me-
~ thods. It actually increases the amount of code being
; 15 transmitted and is merely used to change from one code
format to another, longer code ~ormat ~or use in telegra-
phy.
Patent 3,309,694 entitled "CODE CONVERSION APPARATUS"
is an electronic translator for converting a binary code
r~presentation of a letter, number, or phrase to an audible
and/or visible indication of that letter, number, or
phrase. This is not a system of text rompression.
Patent 3,388,380 entitled '~SYSTEM FOR DISPLAY OF A
;WORD DESCRIPTION OF PARAMETERS AND VALUES THEREOF IN RE-
SPONSE TO AN INPUT OF A WORD DESCRIPTION OF THE PARAMETER"
is a data display system~ It does not involve the compres-
sion of text.
Patent 3,599,205 entitled "BINARY TO TERNARY P~OTECTED
CODE CONVE~TER" is an apparatus for converting nonprotected
code signals having a specified number of bits to protected
code having approximately the same number o~ bits. This
system does not attempt to compress the code.
Patent 3,685,033 entitled "BLOCK ENCODING FOR MAGNETIC
RECORDING SYSTE~S" is a high density recording and repro-
ducing system. It is a method for increasing the number

~L~S'..~
--3--
o~ stored blts per transition on a maynetic recording medium.This is not a means of text compression.
Patent 3,810,154 entitled "DIGITAL CODE TRANSLATOR" is
an apparatus ~or substantially increasing the amount of informa-
tion that can be transferred by a teletype in a given amount oftime. This method does not require an increase in the bandwidth
of the carrier. The transmitting station converts parallel,
fixed-place, digital-coded message characters into serial,
variable-place, digital-coded messclge characters. The receiviny
station reverses the conversion. This is not a method of data
compression using tokenization.
Patent 4,122,299 entitled "DATA OUTPUT MODIFYING
SYSTEM" is a system for placing data in a format for acceptance
by a general purpose communications printer. The data is
lS originally in a format for providing a television display to news
wire service subscribers. This invention involves the conversion
of data from one format to another. It does not attempt to
compress text.
Patent No. 4,229,817 entitled "PORTABLE ELECTRONIC
CRYPTOGRAPHIC DEVICE" is a hand-held apparatus for enciphering
and deciphering text. This device merely changes the presenta-
tion of data and does not compress it.
None of the above-cited patents concern data compres-
sion techniques utilizing tokenization.
DISCLOSURE OF THE INVENTION
In accordance with a first aspect the invention
provides a method of tokenization of trial words using a stored
dictionary of words. The stored words each have locations
relative to the others in the dictionary, the words in the
dictionary being grouped into segments with words of the same
length grouped in a common segment and words of different lengths
grouped in different segments. The words in each segment having
a plurality of portions make up the complete word. In accordance
with the method, a segment is selected in the stored dictionary.
'
:
'
, .

~2~ g
--4--
The words of the selected segment have a length with a predeter-
mined relation to the length of a given trial word which is
undergoing tokenization. Bytes of one of the words in the
selected segment are then compared with bytes of the trial word
for a match. A token is then formed indicative of the relative
location in the dictionary of a word which is compared and found
to have the match with the trial word.
The invention also provides a method for compressing
a series of words. First and second parameters are identified
for one of the words. Those two parameters are then used
together to identify a memory field within a memory having
multiple fields which contain stored words. The stored words in
the identified memory field are then compared to the one word
aforesaid for a match. A token which identifies the matched
stored word is then substituted for the matched word.
In accordance with a further aspect the invention
provides an alternative method for compressing a series of words
in which data representing a first word is compared to represen-
tations of words stored in a first memory to determine whether
a token exists for a sequence of the words beginning with the
first data word. If a token does exist then data representing
a sequence of the words beginning with the first word is compared
to data representing a sequence of words beginning with the first
data word for a match. If the match exists then a token
identifying the sequence of words is substituted for the matching
sequence of words in the serial words.
The invention further provides a compressor for a
sequence of trial words. The compressor incorporates a word
dictionary for storing representations of words, the representa-
tions being accessible according to a first and second parameterfor each word. Means are provided for determining the flrst and
second parameters for each of the trial words. Means are also
provided for accessing, for each of the trial words, at least one
word in the dictionary which has the first and second parameters
thereof equivalent to that of the trial word. A comparator is
provided for comparing each trial word and the accessed word with

-5-
equivalent parameters for a match. ~ token yenerator is also
prov:ided for generating a unique token identifying the accessed
word which matches the trial word and for substituting the token
for the trial word in the sequence of data words.
In accordance with a stil:L further aspect the invention
provides a compressor for a sequence of words wherein the words
are of variable length. A word dictionary is provided for
storing words in association with a token identifying each word.
The dictionary stores the words in grammar byte increments, each
grammar byte corresponding to a precletermined amount of data, the
words being of variable length so that at least some words are
stored in one grammar byte, while other words are stored in at
least two grammar bytes. Means are provided for determining the
number of grammar bytes required to store a word in the sequence
which is to be compressed. Compare logic is provided for
successively comparing the words in the sequence to be compressed
to grammar bytes retrieved from the dictionary for a match.
Means are also provided for determining the token corresponding
to the appropriate dictionary word for each dictionary word for
which the grammar bytes of the dictionary word match the grammar
bytes of the word to be compressed. Means are further provided
for substituting the token corresponding to the matching
dictionary word into the sequence of data words.
In accordance wi~h yet another aspect, the invention
provides a tokenizer for a sequence of words wherein the words
vary in length. The tokenizer incorporates a dictionary for
storing a series of words of varying lengths arranging in varying
numbers of grammar bytes, the grammar bytes having a predeter-
m~ned length. Means are provided for determining the number of
grammar bytes corresponding to a word from the sequence of words
to be tokenized based upon the length of the word to be token-
ized. Compare logic is provided for comparing portions of
dictionary words, each portion corresponding to one grammar byte,
portions of words in the sequence to be compressed each portion
corresponding to one grammar byte. A counter is provided for
counting the number of grammar bytes of each dictionary word

~ ~ 5 ..~
-5A-
which are compared by the compare logic. Means responsive to the
counter are provided for enabling the next word in the series in
the dictionary to be compared by the compare logic.
The invention also provides a tokenizer for a sequence
of words. In this aspect, an addressable dictionary is provided
for storing words grouped according to common values for a
parameter of the words, each group of dictionary words having a
contiguous range of addresses. Means are provided for determin-
ing the parameter value for a tria:L word to be tokenized in the
sequence of word,s. A dictionary map is provided indicating the
starting address and address rangs for dictionary word groups
having an identified parameter value. Compare logic is provided
for comparing dictionary words to the trial word. Means are
provided for accessing dictionary words and supplying them to the
compare logic, the means being in communication with the means
for determining and the dictionary map for accessing the
dictionary beginning with the starting address for a group of
dictionary words having a parameter value equivalent to the
parameter value of the trial word. Means are also provided in
communication with the dictionary map, for restricting the
dictionary words accessed by the means for accessing to those
within the address range indicated by the dictionary map for the
appropriate dictionary word group. Means are also provided for
substituting a token representing a dictionary address identify-
ing a dictionary word in place of each word to be tokenized whichmatches a dictionary word as determined by the compare logic.

g
BRIEF DESCRIPTIOlN OF THE DRAWINGS
FIG. 1 is a block diagram of the symbolic tokenizer
showing a dictionary ROM and a control microprocessor em-
bodying the present invention;
FIG. lA is a schematic and block diagram depicting a
storage array in the ~ontrol microprocessor containing
data for operating the symbo:Lic tokenizer;
FIG. 2 is a schematic and block diagram of the dic~
tionary portion of th symbolic tokenizer and the control
logic for the dictionary;
FIG. 3 is a schematic and block di.agram of the com-
parator portion of the symbolic tokenizer and showing a
mailbus and output multiplexer;
FIG. 4 is a schematic and block diagram of a control
logic portion of the symbolic tokenizer;
FIG. 5 is a schematic of an interface to a control
microprocessor, such as a Commodore 64K computer;
FIG. 6 is a schematic and ~lock diagram of a dictio-
nary ROM text storage arrangement allowing parallel char-
acter processing;
FIG. 7 is a schematic and block diagram depicting aflowchart of the process of controlling the symbolic tokan
i2er for tokenization;
FIGS. 8A-8Pare schematic and block diagrams depicting
a flowchart of the process of tokenization;
FIG. 9 is a schematic and block diagram depicting a
flowchart of the process of detokenization.
FIGo 10 is a schematic depiction of a keytoken list
stored in ths dictionary ROM:
FIG. 11 is a schematic and a depiction o~ a phrase
ta~le stored in the dictionary ROM:
FI~. 1;2 i5 a schematic and a depiction of a memory
map stored in the control microprocessor and used to access
the phrase t:able;

~srs~
1 FIG. 13 is a schematic: and a depictin of a memory
map stored in the control microprocess and used to access
the keytoken list; and
FlGS. 14 and 14A are a schematic and block diagram
depicting a flowchart of the process o~ phra~e token$za-
tion.
:
' 10
'
3S

~:s~
MODES FOR CARRYING OUT THE INVENTION
, . , ~
As clepictecl in FIG. L, a conventional microprocessor
100 is used to control the symbolic tokenizer 102A. The
control microprocessor may be a CommodoreTM 64K computer,
an IBMrM Personal Computer, or any other type of proces-
sor. The control microprocessor includes a storage area,
for example, ROM 100A. However, it is to be understood
that the storage area to be used in the manner described
below may a:Lso be a removable storage medi~lm, such as a
floppy diskette.
The storage area contains a map of the start and end
points of areas to be searched in the dictionary. In the
preferred embodiment, the map is a memory map similar to
that depicted in FIG. lA and is accessed according to the
starting character of the English-language trial word to
be tokenized and the number of characters in that word.
For any given starting character and character length,
there is both a unique starting address, alpha, corre-
sponding to an address of a memory location in the dic-
tionary ROM 104 and a representation of the total numberof grammar bytes, beta, having the given starting charac-
ter and the given character length. The term grammar byte
will be defined below with respect to FIG. 6 after the
dictionary architecture is defined. The starting address
points to a location in the dictionary ROM for starting a
search for words to be compared to the particular word un-
der consideration. The number of grammar bytes indicates
the total number of dictionary look-ups required to com-
pare the trial word with all the words in the dictionary
ROM containing the given starting character and the given
number of characters in the trial word. Therefore, the
number of grammar bytes is used to determine the end point
for searching.
Prior to using the symbolic tokenizer, a dictionary
of preselected, commonly-used words is stored in ROM 104.
.. ~
~ .

~zs~9
9 -
1 The symbolic tokenizer's dictionary ROM 104 stores words
according to starting letter and character string length.
The dictionary effectively contains segm2nts which depend
upon the number o~ characters in the words they contain.
In the preferred embodiment, the first segment contains
words having between one and four characters, the second
between five and eight, the third between nine and twelve,
and the fourth between 13 and 16. The reason for the
divisions into segments is that the language words in any
given segment contain the same number of grammar bytes.
The use of four dictionary segments in the prototype
symbolic tokenizer allows for language words o~ 16 charac-
ters in length, or tokenized phrases of 80 bits in length
in the preferred embodiment. Language words of twenty
characters, or phrases of 100 bits, can be stored with an
appropriate change in the memory map of FIG. lA. Addition
ally, a different trial word register would be required
because the registex of the present embodiment has a phys-
ical capacity of only 16 characters or 80 bits.
Segment codes are used to indicate how many dictionary
reads must occur before th~ results of the comparison
should be sampled. Only four codes, Sl, S2, S3, and S4,
are required because, for example, for the character code
Sl, one grammar byte is needed to read a single character
word, a two-character word, a three character word, and a
four-character word. Therefore, only one clock cycle or
grammar byte is required to read the entire word for the
character code S2, two grammar bytes are needQd to read a
5-character word, and 6-, 7-, or 8-character words. Each
read during a given clock cycle transfer6 a maximum of
four characters from the dictionary ROM to the compare
logic 1~3.
Each dictionary segment contains up to four subseg-
ments and up to 104 sectors. There is one subsegment for
each of the 4-character string lengths in the se~ment.

--10--
1 Ther~ is one sector ~or ea~_h different starting letter
and each unique character string length of the language
words within that dictionary segmenk. For examplel if a
particular dictionary subse~nent contained language words
S of two characters starting with the letters A through Y,
but no language words starting with Z, then that dictionary
subsegment would contain 25 sectors of words. That is,
one sector of words for each unique starting character
and character string length.
10Division of the memory map into both segments and
sectors permits two-dimensional accessing of the memory
map data ~or obtaining the dictionary starting addr~ss
for the search. Therefore, no searching of the memory
map is re~uired, only direct look-up, and only a minimum
f searching of the dictionary is necessary to determine
whether the trial word is present in the dictionary.
Only that sect~r of the stored dictionary where words of
; the appropriate starting letter and character length are
stored is searched. This two-dimensional accessing vf
the memory map permits an extremely fa~t search o~ the
stored dictionary because only a small portion of the
stored dictionary is actually searched.
The symbolic tokenizer compares a trial word to a
language word within the stored dictionary and returns
the address in the dictionary ROM where that word may be
~ound. The address is then provided to the control micro-
processor where the token is derived. In the preferred
embodiment, the token is the address of that word within
the dictionary, divided by the number of the dictionary
segmer,t within which the word was found, and added to an
offset. The offset is determined by the dictionary segment
and`depends upon the location of ~he segment within the
dictionary.
This is done, instead of simply using the unaltersd
3S dictionary address as a token, because the division re~ults

~:s~
1 in a binary number one digit less in length than the ori-
ginal number. The addition of the o~fset restsres in~orma-
tion lost in the division ~o that each token ~till defines
a unique address within the dictionary.
During detokenization, the reverse process takes
place. The control microprocessor knows from which dic-
tionary segment the token was generated, because tokens
generated from given segments fall within ~iven numeric
ranges. Tokens numerically greater than one given number,
but less than another given number, must have their corre-
sponding language word with.in a corresponding dictionary
segment. That is, each dictionary seyment has tokens that
fall between predetermined numbers.
Since the dictionary segment can be determined from
the token itsel~, the offset can be subtracted from the
- token and the token then multiplied by the appropriate
dictionary segment number to find the address o~ the lan-
guag~ word within the dictionary ROM. During detokeniza
tion, the ROM address is supplied to the symbolic tokenizer
by the control microprocessor, and the symbolic tokenizer
returns the language word represented by khe token to the
control microprocessor.
As shown in FIG. 6, language words are repre~ented
in the dictionary by one or more 20-bit bytes called gram-
mar bytes. Each 20-bit byte represents four characters
from the language word. Language words may be represented
by up to four 20-bit bytes in the dictionary because the
- present embodiment contemplates storing words of up to 16
characters. The use of up to four 20-bit grammar bytes
provides a variable-byte word length of 20 bits each to
maximize dictionary storage e~ficiency. For example, the
word "water" is stored in two 20-bit grammar bytes in
ROMs 1~5, four characters in the first byt~ and one char-
~; acter in the second. Additionally, as indicated in FIG.
2 of the preferred embodiment, ten dictionary ROMs 1-10

3~ZS5,8~
1 are separated into two banXs of five ROMs each, ROMs 1-5
and ROMs 6-10. Hence~orth, ROMs 1-5 will be considered,
but all the grammar bytes for any given word are stored
in the same bank o~ ROMs, and it will be understood that
ROMs 6-10 can be treated in an e~uivalent manner. Fox
the word ~Iwater~ the last four charactexs are stored in
one 20-bit grammar byte of ROMs 1-5, as indicated in FIG.
6. The least si~ni~icant ~our bits o~ the letter "r" are
stored in the least signi~icant four bits o~ ROM 1. The
most significant bit of the letter "r" is stored in the
least significant bit of first ROM 2. The least signi~i-
~ cant three bits of the letter "e" are stored in the remain-
-; ing least ~ignificant three bits of the first four hits
of ROM 2, and the most significant two bits of the letter
"e" are stored in the least two significant bits of the
first our bits of ROM 3. The two least significant bits
o~ the letter "t" ar~ stored in the two most significant
bits of the first four bits of ROM 3~ and the three most
significant bits of the letter "t7' are ~tored in the three
least significant bits of the first Pour bits of ROM 4.
The least significant bit of the letker '1a" is stored in
the most significant bit of the first four bits of ROM 4,
and the four most significant bits o~ the letter "a" are
stored in the first four bits of ROM 5. Because fiYe
ROMs are used in a bank, and only ~our bits from each ROM
are used for any given word, one character, or five bits,
from the word "water'l still remains to be stored. A sec-
ond 20-bit byte is then used to store the last character~
It should be noted that whether the least siynificant or
most significant bits of the ROMs are used depends o~ly
~ on the address used to access the ROMs and will vary ac-
; cording to how the actual words are stored~
: As is apparent ~rom FIG. 6, each of the chara~ters
"a'l, "t", "e", and "r" are stored in the same memory loca
; 35 tion having the same address in each of ROMs 1~5. There-
' .

~Z5~
-13-
1 fore, all the letters in a s~rammar byte may be processed
in parallel for any given address or memory location in
the bank of RO~s. In other words, the letters "a", "t",
"e", and "r" can be accessed simultaneously or in parallel.
As is well known in the art, standard ROMs have eight
pins for accessing the memory locations. However, the
pref~rred embodiment provides access to only four pins,
or bits, of memory locations per ROM at any one kime and
provides five ROMs in a bank so that an integer number
~i.e., 4~ of characters may be represented in the bank of
ROMs without repeatedly leaving blanks or null characters.
The other four bits, or memory locations, for a given
address in ROMs 1-5 are used ~or another grammar byte,
but only in conjunction with each other. H~reafter, only
one set of ~our bits of the eight bits available in each
of ROMs 1-5 will be discussed. It will be assumed that
other grammar bytes are allocated among each of the remain-
ing four bits in ROMs 1-5.
It is undesirable to use all eight bits from a bank
of only three ROMs, for example, all eight bits of each
of ROMs 1 and 2, and only four bits of the eight bits of
ROM 3 to form a 20-bit grammar byte. ~he remaining four
bits in ROM 3 would not be used. Similarly, the same
arrangement would not be used with a bank of ~ive ROMs,
Z5 since the remainder of ROM 3 and all o~ ROMs 4 and 5 may
contain a 20-bit grammar byte of an entirely di~ferent
word than the 20-bit grammar byte of ROMs 1 and 2 and the
first half of ~OM 3. Special multiplexing would be neces-
sary to be able to read the grammar byte from ROMs 1-3
and not the grammar byte in ROMs 3-5. Therefore, using
only four bits of each of five ROM6 is preferred.
Because one wAole character remains to be stored
from the word "water", the next succeeding memory location
in each of` ROMs 1-5, defined by the next succeeding ad-
dress, is used to store the character "w". In this regard,

~55~
--14--
1 the least significant four bits of the character "w" arestored in the least signific:ant four bits o~ ROM 1, and
the most significan~ bit of the character "w" i5 stored
in the least significant bit of the Iirst four bits of
5 ROM 2. Zeros or null characters are placed in the three
most significant bits of the four bits used in ROM 2, and
in khe least significant four bits used in each of ROMs
3, 4, and 5.
Words containing only Pour characters or less would
10 retluirs an appropriate numl: er OI bits of only one grammar
byte in ROMs 1-5. Words containing between five and eight
characters would require all 20 bits OI a îirst grammar
byte and an appropriate number of the 2 0 bits of a second
grammar byte. For an 8-character word, two 20-bit grammar
15 bytes would be required. For words containing between 9
` and 12 characters, a series of three 2 O-bilt gralT mar bytes
would be required, all of the bits in the ~irst two grammar
bytes beint3 used while an appropriate number of bits in the
last 20-bit gra~unar byte are taken up by the ret~uired
20 bits for tha appropriate number of characters remaining.
The same comments apply with respect to words containing
b~tween 13 and 16 characters. Additionally, it would be
a short extension to include words having between 17 and
20 characters by allowing up to five 20-bit grammar hytes
25 per Word.
The storage of the bits in the various ROMs is con-
trolled during creation of the dictionary in a manner
that would be obvious to one skilled in tha art in view
of the description herein. The memory map in the control
30 microprocessor is then created based on the particular
physical location oP the grammar l~ytes in the respective
bank of ROMs. Access to the ROMs for searching for dict-
ionary words and f or reading the contents of the various
memory loca tions is as described below with respect to
35 FIG. 2 .

5~9
-15~
1 Phrases up to 80 bits long are also stored in the
dictionary. The phrases are stored as previously tokenized
words;therefore, several words may comprise a phrase which
is stored in four 20-bit bytas. That is, since the phrases
are stored as tokens of, at most, 20 bits ea~h, and not
as language words, their storage requires much less memory
than i~ the phrases were stored as their ASCII or EBCDIC
equivalents. Depending on whether a token is used, or
merely the origlnal address, the "token" is typically
lo between 14 and 16 bits long.
The character len~th distribution o~ language words
varies depending upon a particular given dictionary's
vocabulary. For exampl~, a dictionary may be designed
around a standard business vocabulary. The business vocab-
ulary has a character population density that places about90% of its words in the 5- to 12-character length category~
The following is a segment map that represents a
business dictionary. Other vocabularies may be divided
di~erently. The segment division points are software
programmable functions of vocabulary and are defined wlthin
the control microproaessor's tokenizer dictionary map.
The prototype s~mbolic tokenizer~s ~usiness dictionary
; is composed of 13,312 language words. The business dic-
tionary is divided into four segments designa ed as Sl,
S2, S3, and s4.
Segment Sl contains approximately one thousand words
and is addressed by the hexadecimal numb~rs 0000 through
03FF o
Segment S2 contains approximately six thousand words
and is addressed by the hexadecimal numbers 0400 through
lBFF.
Segmenl: S3 contains approximately five thousand words
and is addr~essed by the hexadecima~ numbers lC00 through
2FFF.
Segment: S4 contains approximately one thousand words

3l2~i51~
-16-
1 and is addressed by tha hexadecimal numbers 3000 through
33FF.
Sl expresses words reguiring one 20-bit grammar byke
(up to four characters). S2 expresses words requiring
two 20-bit grammar bytes Ol- 40 bits (between five and
eight characters). S3 expresses words requiring three
20-bit grammar bytes or 60 bits (between nine and twelve
characters). S4 expresses words requiring ~our 20-bit
grammar bytes vr 80 bits (between thirteen and sixteen
characters).
The remainder of the tokenizer 102A will now be de-
scribed. The compare logic 103 compares the trial word
from the trial word register 111 to selected individual
words from the proper segment of the dictionary ROM 104.
Address counter 105 determines each address o~ the dictio-
nary ROM 104 from which a word, or a portion of a word is
to be taken to be compared by the compare logic 103.
Words are transferred from the dictionary ROM 104 to the
compare logic 103 over the grammar bus 112.
The address counter 105 is a register/counter and
holds the memory address ~or ROM 104 indicating the grammar
byte to be accessed. After a first grammar byte is read,
the address counter is incremente~ so that the grammar
byte in the next succeeding address can be read.
The segment register 106 determines how many dictio-
nary reads must take place to determine when the output
of the compare logic 103 should be sampled. This is ne-
cessary because words stored in the dictionary ROM 104
are variable in length and require from one to four reads
to be output to the compare logic 103.
A segm!ent counter 107 counts down while th,e address
counter 105 counts up. The segment counter 107 counts the
number of grammar bytes required to read a gi~en language
word from the selected segment of the dictionary ~OM 104
being searched. During the read of dictionary ROM 104,
.

~2~309
17-
1 the first yrammar byte will be read from the memory loca-
tions of ROMs 1-5. At the end of the read, during the
first clock cycle, the segment:counter will be decremented,
and the address counter will be incremented. Each time
the segment counter 107 reaches zero (i.e., when the entire
word has been read from the dictionary ROM), the segment
counter 107 is reloaded from the segment register 106 so
the next word can be read.
The control logic 109 stops the address counter 105
from incrementing when either a comparison by the compare
logic 103 is true, or the en~ire sector of the dictionary
ROM 104 which was to be searched had been searched, and no
comparison by the compare logic 103 was found to be true.
The csntrol logic 109 also accepts tokenizer control codes,
shown as TCC in FI5. 1, from the control microprocessor
100 through the mailbus 102 to be described more fully
below.
A brief description of the process will now be given
with respect to FIGS. 1 and 7.
As shown in FIG. 1, a control microprocessor 100 ~a
Commodore 64K in the prototype) will have a memory map
for the dictionary ROM 104 alr~ady created as di~cussed
above, and the dictionary ROM will contain the desired
library of dictionary words. The control micxoprocessor
100 issues a tokenizer reset command through interface
101 to the mailbus 102. The interfac~ 101 i~ described
below with respect to FIG. 5. The reset command is used
to reset the segment register 106, and reinitialize the
segment counter 107. The tokenizer reset also reinitial-
iz~s the address counter 105 and the control logic 109.
The scan limit down counter 108 and ~he status register 110
are also reinitialized.
The control microprocessor then determines the number
of charactexs in the trial word under consideration and
datermines a segment code arcording to the number of char~

~2~ 9
-18-
1 acters in the trial word. As discussed above, lf the
trial word contains between one and Pour characters, a
segment code reprcsenting "SL~ will be issued to the seg-
ment register 106. If the trial word contains between
five and eight characters, a segment code representing
"S2" will be loaded in the segment register. If the word
contains between nine and twelve characters, a representa-
tion o~ the segment code "S3" is loaded, and if the trial
word contains between thirteen and ~ixteen characters, a
lo segment code representing "S4" is loaded. ~or example,
the word "water" uses two grammar bytes, so the segment
counter would contain a representation oP "S2" or "2l',
indicating that two grammar bytes must be read for "water".
The control microprocessor then supplies the trial
- 15 word "water" through the inter~ace 101, and the parallel
mailbus 102, to the trial word register 111. During the
tokenization process, the trial data is the word on which
tokenization is to be attemptad.
The control microprocessor utilizes the character
string length o~ the trial word and the beginning character
of the-trial word to access the dictionary ROM information
contained in the memory map in storage device lOOA. The
microprocessor obtains the starting address and thectotal
number of grammar bytes for the words starting with the
given character and having the given character count. ~s
indicated above, beta indicates the number o~ total grammar
bytes of the words containing the given letter and having
the given character length. Since a single grammar byte
may not contain representations of all characters in the
stored word, several grammar bytes may be required to
form a complete representation o~ the stored word. How-
ever, beta is a representatio~ o~ the total number oP
grammar bytes needed to step through all the entries in
the dictionary ROM containing words starting with the
3S given character and having the given character string

~2~58~)
--19--
1 length. By way of example, the word "water" and all other
five-character words beginning with the charact~r "w" and
stored in the dickionary ROM would require two gram~ar
bytes to store the ~ndividual word. If there were five
s words stored ir. the dictionary ROM, each starting with
the character "w", and each consisting of a total of five
characters, then the total number of grammar bytes, beta,
reguired to search through the dictionary ROM for the
five words would be ten grammar bytes. In other words,
the toXenizer 102A must cycle or search through the dic-
tionary ROMs, starting at the star'ing address, represented
by alpha, twice ~or each word. Carrying out the twice-
; repeated cycle for each stored word requires a total of
ten cycles or look-ups. Therefore, beta, corresponding
to words starting with the character "w" and having five
characters, would contain a representation of "10".
The control microprocessor loads the starting address
into the address counter 105. For example, if the words
starting with "w" and having five characters were found
beginning at tha first address location of ROM 104, the
address counter would be loaded with a representation of
the hexadecimal number 0000. ~he control microprocessor
also loads the scan limit down counter 108 with a repre-
sentation of beta, the total number of 20-bit grammar
bytes t~ be read- from the dictionary ROM 104. In the
present example, beta is 10.
Each of the above steps are each preferably accom
plished during a respective single clock cycle of the
microprocessor. The particular order in which the steps
arP carried out is not significant, as long as the koken-
izer registers are loaded with representations of the
appropriate information. However, as would ~e clear ~o
one skilled in the art, reset of the tokenizer must occur
prior to loading of any registers in the tokenizer. Once
the registers have been loaded, the control microprocessor

~25~g
-20-
1 issues a start-scan command to the tokenizer 102A. Once
the start-scan command i~ issued, the tokenizer steps
through a defined se~uence of steps, and the con~rol micro-
processor is free to carry out other operations. The
control microprocessor intermittently tests the mailbus
102 for one of three status codes, a representation of
DEQ representing data equals, a representation of DNF
meaning data not found, or a representation of SIP indi-
cating scan in progress~ In the prototype, a mailbus
interrupt was possible for the interface 101 to interrupt
the operation of the control microprocessor 100 when a
mailbus interrupt signal is received from the tokenizerO
During the toXenization process, the tokenizer com-
pares the trial word with each word in the dictionary ROM
identified according to the starting address and the number
of grammar bytes. The compare logic 103 will compare the
first grammar byte o~ the trial word contained in register
111 with the first grammar byte read ~rom the first memory
location in dictionary ROM 104. At the same time, the
segment counter will be decremented so that it contains a
representation of "1", indicating that one more grammar
byte must be read in order to read the entire word "water".
Additionally, the address counter will be incremented so
that the counter contains a representation of the hexade-
cimal number 0001. The scan limit down counter 108 is
also decremented so that it contains a representation of
09~O If a match is found in the first grammar byte, no
further change in state of the tokenizer is made. During
the next cl~ck cycle, the next grammar byte is read from
dictionary ROM 104 and compared with the next grammar
byte in trial word register 111. Additionally, the segment
down counter is decremented to a representation of 1l 0 ",
indicating that all grammar bytes for the given word have
been read and compared. At this point, the compare logic
3S is sampled l:o determine the status o~ the compare. Since

~2~i~;8~9
-21-
1 a match would be ~ound in thle present example, the token-
izer issues the DEQ (data equals~ signal, and the control
microprocessor issues a read command to the tokenizer so
that the address of the dictionary word ~or which a match
was found can be read by the control microprocessor. The
address is the~ used by the control microprocessor to
determine an appropriate token. Specifically, the address
is preferably divided by the segment number within which
the word was ~ound and added to an of~set, as described
above. The token is either then stored as a substitute
text storage scheme, or may be communicated to another
processor as part of a text transfer. Detokenization may
then be conducted at any time.
If no match is ~ound after the comparison of the
trial word with the word in the starting address, the
starting address is incremented, the segment down counter
; is reloaded with the representation o~ the segment code,
the scan limit down counter is decremented, and the next
5-charclcter word starting with the character "w" will
then be available to be read from dictionary ROM 104. It
should be noted that if a match is no~ found for the first
grammar byte, the comparison process is still continued
so that the address counter is incremented, and the scan
limit down countex is decremented, as ~ppropriate, so
that the tokenizer can start with the address of ~he next
succeeding word as the comparison process continuss.
Additionally, the continued cycling allows the segment
down counter to be rel~aded with the representation o~
~he segment code so that all of the grammar bytes for the
next succeeding word will be read and compared. If no
match is found a~ter the appropriate number of grammar
bytes have been accessed, the DNF (data not found) signal
is provided to the status register 110, which then produces
a DNF to mailbus 102. This signal is provided to the
control microprocessor which then either terminates the

~2~5~ 9
l text processing or repeats the above-descri~ed process
with a subsequent trial word~
As shown in FIG. 7, il the control microprocessor
identi~ies ~he trial word as having greater than 17 char-
s acters, a data not found code is issued, and the controlmicroprocessor returns to process another word or ends
the text processing. In the situation where the trial
word contains more than 16 characters, the particular
trial word is stored as the ASCII or EBCDIC equivalents.
lo ~owever, this would occur in a very small percentage of
the words~
The tokeni~er and the method and means by which the
control microprocessor directs the tokenizer will now be
considered in detail. The tokenizer control codes are 9-
bit commands from the control microprocessor lO0 _o thes~mbolic tokenizer. The commands are divided between
write csmmands, which cause outputs to be sent from the
control microprocessor lO0 to the symbolic Tokenizer, and
read commands, which cause inputs to be sent ~rom the
symbolic tokenizer to the control microprocessor lO0.
~he commands are as follows:
.

3~,2i~,5~
-23-
1 ~AB];E I
WRITE COMMANDS
MAILBUS BIT HEX
ADDRESS VALUE FUNCTION
5SSSSOOOOO 00 TOKENIZER RESET
SSSSOOOOl 01 TRIAL WORD BYTE lA
SSSSOOO10 02 TRIAL WORD BYTE lB
SSSSOOOll 03 TRIAL WORD BYTE lC
SSSSOO100 04 LOAD SCAN ADDRESS
REGISTER LOW BYTE
SSSSOO101 05 TRIAL WORD BYTE 2A
SSSSOOllO 06 TRIAL WORD BYTE 2B
SSSSOOlll 07 TRIAL WORD BYTE 2C
SSSSO1000 08 LOAD SCAN ADDRESS
REGISTER HIGH BYTE
SSSSO1001 09 TRIAL WORD BYTE 3A
SSSSO1010 ~A TRIAL WORD BYTE 3B
SSSSO1011 OB TRIAL WORD BYTE 3C
SSSSOllOO OC LOAD SCAN ADDRESS
WORD COUNTER LOW BYTE
SSSSOllOl OD TRIAL WORD BYTE 4A
SSSSOlllO OE TRIAL WORD BYTE 4B
SSSSOllll OF ~RIAh WORD BYTE 4C
SSSS10000 10 LOAD 5CAN ADDRESS
z5 WORD COUNTER HIGH BYTE
SSSS10100 14 LOAD SEGMENT REGISTER
SSSSllOOO 18 ST~RT SCAN
:

~L2S~
-24--
TABLE I I
READ C MMANDS
MAILBUS BIT HEX
ADDRESS VALUE FUNCTION
5SSSSXX000 00 TOKEN ADDRESS BITS 0-7
SSSSXX001 01 TOKEN ADDRESS BITS 8-14
SSSSXX010 02 GRAMMAR BIJS BITS 0-7
SSSSXX011 03 GRAMM~ BUS BITS 8-15
SSSSXX100 04 GRA~ BUS BIT5 16-19
10SSSSXX101 05 TOKENIZER STATUS BITS 4-7
'~
;

-25-
1 The first four digits of each mailbus bit address
(SSSS) are the bits which the control microprocessor 100
uses to select with which s~olic tokenizer it will com-
municate when more than one symbolic tokenizer sits across
the mailbus 102. This i~ the address set into the dip
switch 57 and sensed by the address select 53, both of
FIG. 4. The X's in the read commands indicate diyits not
used.
The read and write commands are utilized by a mailbus
lo address llne 116, to be described below with respect to
FIG. 4, to enable the correct logic component to accept
the data being provided by the control microprocessor on
the mailbus data lines 115.
The TOKENIZER RESET write command is sent from the
control microprocessor 100 to the s~mbolic tokenizer to
reset the symbolic tokenizer's logic.
: The TRIA~ WORD BYTE commands are used to load the
trial word into the trial word register 111 in the form
of register files 34-3&. As discussed above, the trial
word transmitted to the symbolic tokenizer may be up to
80 bit~ long. This allows for trial words of up to 16
characters, as twenty bits are required for each ~our
charactersO Transmission of the trial word to the symbolic
- tokenizer is accomplished in up to twelve separate trans-
missions, i.e., a maximum o~ eight 8-bit loads and four
4-bit loads.
The TRIAL WORD BYTE lA through lC write commands
repr sent commands to load the ~irst 20 bits of the trial
: word. TRIAL WORD BYTE 2A th-ough 2C represent commands
to load the second 20 bits of the trial word. TRIAL WORD
BYTE 3A through 3C represent commands to load the third
20 bits of the trial word. TRIAL WORD BYTE 4~ through 4C
represent commands to load the fourth 20 bits of the trial
: word. In each case, a TRIA~ WORD B~TE having an 'A9 suf-
fix, such as TRIAL WORD BYTE 4A, represents a command to

~s~
-26-
1 load bits O through 7, a T.RIAL WORD BYTE having a 'B'
suffix represents a command to load bits 8 through 15,
and a TRIAL WORD BYTE having a 'C' suffix represents a
command to load bits 16 through 19.
LOAD SCAN ADD:E<ESS REGI'STER LOW BYTE and LOAD SCAN
ADDRESS REGISTER HIGH BYTE write commands enable loading
in the address counter of khe 16-bit dictionary address,
where the search is to begin. This is the address o~ the
first word in the dictionary sector where the word on
which toXenization is being attempted will be contained,
if it is in the dictionary. This address is derived ~rom
- alpha in the memory map and is loaded in two 8-bit bytes
to the address counter 105 in the form of token address
counter 16-19.
LOAD SCAN ADDRESS ~ORD COUNTER LOW BYTE and LOAD
SCAN ADDRESS WORD COUNTER HIGH BYTE write rommands enable
loading of the scan limit down counter 108 in the form of
dictionary scan down counter 54-56 of FIG. 4. The dictio-
nary scan down counter 54-56 is loaded with the number o~
grammar bytes in the sector of the dictionary being
searched so that the dictionary scan down counter 54-56 can
: decrement as grammar bytes are compared until all words
~ of that dictionary sector have been compared. SCAN
: ADDRESS WORD COUNTER LOW BYTE allows loading of eight
bits of the count, and LOAD SCAN ADDRESS WORD COUNTER
HIGH BYTE allows loading of the remaining four bits of
the count. As discussed above, the counter prevents dic-
tionary reads beyond the proper dictionary sector. For
example, if there are five 5-l~tt~r words beginning with
30 "w", then the dictionary scan down counter will be loaded
with a representation of 10 (five words times two grammar
bytes per word).
The LOAD SE~MENT REGISTER writa command enables load
ing of the Segment Register 106 with the segment code.
The ST~RT SCAN write command is the command from the
.

-27-
1 control microprocessor loO to the symbolic tokenizer to
begin a s~arch of the dictionary. It iæ issued after the
above output command functions have been issued and the
appropriate logic device set,.
The R~AD commands are address codes utilized in con-
junction with a ~BRW (mail 45 read/write) code from the
control microprocessor to enable acce~s to the contents
of various logic components in the tokeniz~r for r~ading.
The TOI~NIZER ADDRESS BITS 0-7 and the TOKENIZER
ADDR~SS BITS 8-14 read commands enable access by the con-
trol microprocessor to the address counter to read the
address o~ the dictionary word with which a comparison
was found true.
The GRAMMAR BUS BITS 0-7, the GRAMMAR BUS BITS 8-15,
and the GRAMMAR BUS BITS 16-19 read commands load the
contents of the grammar bus 112 into the control micropro-
cessor 100. During detokanization, the grammar bus con-
tains the language word which was represented by the ad-
dress derived from the token.
The token address, shown in FIG. 1 as TKA, is a form
of the token. The token may be the actual memory address
or may be a derivation of the memory address, as described
above. The token address is the address within the dic-
tionary ROM 104 where the word was found. The token ad-
dress is transmitted through the token address bus 113
and the mailbus 102 to the co~trol microprocessor 100
when the compare logic 103 is true.
; Detokeni~ation, or the converting of tokens back
into language words, is achieved by presenting the toXen
as a dictionary address on the tok~n address bus 113
through the address counter 105. During detokenization,
the address counter 105 functions as a simple address
register, instead of as a counter.
Ihe data found at the addressed ROM 104 location is
output to the grammar bus 112. This data is the language

-2~-
1 word which was represented by the token on which detokeni-
zation is taking place.
During detokenization, the language word 6tored in the
dictionary ROM 104 at the address represented by the token
is transferred through the grammar bus 112 and the mailbus
102 to the control microprocessor 100.
The TOKENI2ER STATUS BITS 4-7 read command requests
the s~mbolic tokenizer to send its status to the control
microprocessor 100. The symbolic tokenizer status may be
Data Equals (DEQ), Data Not Found (DNF), or Scan In Prog-
ress (SIP). Data Equals is represented by the binary
code 001. Data Not Found i5 represented by the binary
code 010. Scan In Progress is represented by the binary
code 100.
Scan In Progress indicates to the control micropro-
cessor 100 that a dictionary search is currently in pro-
gress. Data Equals indicates that a ¢omparison was ~ound
to be true, and Data Not Found indicates that no comparison
was found to be true for the trial word on which tokeniza-
; 20 tion is being attempted.
The componenks of the tokenizer will now be described.
As shown in FIG. 2, dictionary ROMs 1 through 10 store
the dictionary of words which may be tokeniæed. The dic-
tionary of the prototype unit utilizes ten 2764 EPROMs.
EPROMs are used in the prototypa unit to allow for ease
of changing thair content as the tokenization of dif~erent
vocabularies is studied. ROMs will be used in production
units because of their low cost The ROMs are 8-bit
devices, each having a capacity of eight kilobytes. Toge~
ther they are used to generate 32 kilobytes of 20-bit
words.
As discussed above, the tokenizer manipulates 20-bit
grammar bytes. The grammar bytes are gensrated by taking
five of the ten ROMs at a time and utilizing four of the
eight bits from each ROM to obtain a 20-bit-grammar byte.

3L25~
-29-
1 The ROMs are accessed using the dictionary address
through token address counter 16-19 described below. The
least significant bit of the dictionary address causes
multiplexing between the least ~ignificant and the most
significant, four biks of each Dictionary ROM 1 through
10. In other words, i~ the address i8 even, for example,
the first four bits of each ROM in the bank of ROMs will
be accessed, whereas if the address is odd, only the last
four bits of the eighk bits of each ROM will be accessed.
lo This obtains the full 32-kilobyte address range of the
ROMs and facilitates parallel processing as discussed
above. The dictionary ~OMs 1 through 10 are addressed by
the token address bus 113.
There are eight data lines exiting each of the ten
ROMs and going to the data multiplexers 11-15 for reading
the contents of ROMs. ~his gives forty data lines coming
from each group of five RO~s. Twenty lines are re~uixed
for each 20 bit grammar byte. As described above, multi--
plexing switches between the low four and high four output
lines of each ROM so that each group of five ROMs forms a
20-bit byte at any given time. This effectively doubles
the addressing capability of the dictionary ROM.
All of the lines on FIGS. 2 through 5 designated
with a "+" are positive 5 volts.
The data multiplexers 11 through 15 are 74LS257 data
multiplexer chips. They switch between the low and high
~our output lines of dictionary ROM 1-10 and are switched
by the least significant bit of the absolute address
(TXADOO). That is, the multiplexers are switched by the
token address Oo of the token address bus 113. A zero on
line 00 o~ the dictionary address bus indicates that the
low four bits of each ROM on the dictionary address bus
are to be used for data and a 1 on line 00 of the token
address bus 113 indicates that the high four bits of each
ROM on the dictionary address ~us are to be used for data.

~25~B~.9
30~
1 Token address line 00 (TKAD00~ and token address line 14
(TKAD14) are both part of the token address bus 113.
The grammar bus 112 connects the data mu~ tiplexers
11-15 to the comparator 29-33 of FIG. 3~
The absolute address rec~ired to access words in the
dictionary ROM is 15 bits long, as can be seen from the
output of the token address counter 16-19. The 15-bit
address with multiplexing allows 32,768 absolute dictionary
addresses.
~0 The token address counter 16-19 ls comprised of four
74Lslsl up/down counters configured as an incremental
counter. The token address eounter 16-19 accepts the
starting address of the first word in the dictionary sector
to be searched and then increments from that address to
lS the end of the list in that sector at the mailbus clock
rate. The end of the list is determined by the starting
address and beta.
~he starting address for the dictionary segment and
~: sector derived from the memo~y map is parallel loaded
from control microprocessor 100 through tha external mail-
bus 115 to the internal mailbus 114 and then into the
address counter 16-19.
As shown in FIG. 5, the mailbus 102 of FIG. 1 is di-
vided, first, into the external mailbus ~Databus) 115,
: 25 the mailbus address bus 116, a mailbus input/output (MBIO),
mailbus interrupt (MBINT inverse), mailbus clock (MBCLK),
and mailbus read/write (MBRW) and, second, into the inter-
nal mailbus 114. The external mailbus 115 is the eight
data lines which enter the bidirectional data bus trans-
30 ceiver 4~ from the interface 101 of FIG. 1. The internal
mailbus 114 is the eight data lines between the bidirec-
tional data bus transceiver 46 and the tokenizer output
multiplexer 41-45, the register file 34-38, as well as
the token address counter 16-19 of FIGo 2~ and the dictio-
35 nary scan down counter 54-56 of FIG. 4.
.

i.8~
-31-
1Each of the three chips of the token address counter
16-18 supplies four of the bits required to form the 15-bit
dictionary address, with the last chip of the token address
counter 19 supplying the most: signi~icant three bits.
5As seen in FIG. 2, inverter 20 is part of a 7404
digital inverter chip. Inverter 20 produces the inverse
of the mailbus clock (M~CLK) i'rom the mailbus clock (MBCLK)
signal. The inverse of the mailbus clock is used by com-
pare latch 39 o~ FIG. 3, and it is also the beginning
of a time-delay chain comprising inverters 21 and 22.
Diyital inverters 20, 21, and 22 provide a timing
delay of the clock signal through sequential propagation.
Digital inverters 21 and 22 are part of a 74LS3~ NAND
; gate configured as inverters. This delayed clock signal
is used fDr the clocking o~ the token address counter
16-19. This delay guarantees that the address counter
will not increment prior to the ac~uisition of distionary
data from the grammar bus 112. The toXen address counter
16-19 increments on the rising edge of the delayed clock
Signal.
Noninverting buffer 23 is a 7407 noninverting buffer
chip. It provides a sequential propagation delay as part
of the timing delay circuit which further delays the ~lock
signal to the token address counter 16~19.
25Dictionary enable AND Gat~s 25 and 26 are 7408 AND
-gates. They provide the dictionary enable, which supplies
the enable signal to the dictionary ROMs 1-10 so that the
ROM address locations may be accessed. Two AND gates are
utilized to provide fanout since they must provide 2 signal
to each o~ the ten dictionary ROMs 1 through 10~ AND
gate 25 provides an enable signal to six of the dictionary
; ROMs 1-5 and 6-10, and AND gate 26 provides an enable
signal to four.
The dictionary ROMs 1-10 are tri-state output devices.
; 35When the outputs of 25 and 26 go low the selected dictio-
:

i8~
3~-
1 nary ROMs 1 through 5, or 6 through 10, appear on the data
bus. The dictionary ROMs 1 through 5, or 6 through 10,
are selected by the token address (TKAD 14) described
below), and its output appears on the grammar bus 112.
The dictionary enable AND gates 25 and 26 go low
when two signals are present at their inputs. Those sig-
nals ar~ mailbus ~lock inverse ~MBCLK) ~nd grammar bus
lock i~nverse (GBLOCK).
Grammar bus lock (&BLOCK~ (see tha output of NAND
gate 28 in FIG. 3) is used for reading the contents o~
khe dictionary ROM 1-10. During detokenization when a
direct look-up is made to the dictionary ROMs to convert
a token into a language word, the output of the grammar
bus 112 must be on the mailbus 102 so that it may be
accepted by the control microprocessor 100. GBLOCK allows
output of the contents o~ the addressed memory.
Inverter 27 is a part of a 7404 Hex invexter. Inver-
ker 27 selects the group of five dictionary RO~s 1-5, or
6-10~ which is to be enabled. The input of the inverter
27 is coupled to the bank o~ ROMs 1-5, while the output
i8 coupled to the bank of ROMs 6-10. Either dictio~ary
ROMs 1-5, or 6-10, will alw ys be selected. The input to
inverter 27 comes from the fifteenth ~it (mos signi~icant
bit) of the token address counter 19 (TKAD14). That is,
the fi~teenth bit from the token address counter 19 deter-
mines which group of ~ive dictionary ROMs 1~5, or 6-10,
- is to be enabled. Dictionary ROMs 1-5 contain the lower
16 kilobytes of dictionary memory, and dictionary ROMs
6-10 contain the upper 16 kilobytes of dictionary memory.
- 30 NAND Gate 28 of FIG. 3 is for producing GBLOCK lnverse
and is a part of a 74LS00 Quad NAND gate. It provides
the grammar bus lock inverse (GBLO~K) which is used by
the dictionary enable AND gates 25 and 26 to force the
enable of the dictionary ROMs 1-10 so that the dictionary
~ 3~ ROMs 1-10 may be int~rrogated during the detokenization
:

8~-~
-33-
operation,
FIGS. 3 and 4 can be placed si~e by side, with FIG.
4 on the right, to form one continuous schematic.
As ~hown in ~IG. 3, c~mparators 29-33 compare the
contents of the grammar bu~; 112 (from the multiplexer6
11-15) with a trial 20-bit grammar byte which is part ~f
a trial word loaded into the register file 34-38. From
one to four 20-bit bytes (grammar bytes~ are compared for
each language word, depending upon the length of the word.
Each grammar byte represents up to four characters of a
language word.
R~gister file 34-38 is loaded with the trial word
from the microprocessor through the bidirectional data
bus transceiver 46 and the external mailbu~ 115, as de-
scribed above. Each chip of the register file is a groupof 4-bit register slices, i.e., ~our 4-bit registers.
At any one time, the address buf~er 5B and the register
file select 59, described below, may address any 4-bit
register in each chip in order to load data in the form
f bits representing characters of the trial word. How-
:~ ever, because internal mailbus 114 passes only eight bits
of data at a time, only two 4-bit registers may be loaded
at a time. The register file 34-38 is coupled to the
internal mailbus 114 such that the lowest four bits of
the data bus and the highest four bits are coupled to
; pins 15, 1, 2, 3 of chips 34 and 35, respectively, and to
the same pins of chips 36 and 37, respectively. The lowest
four bits are also coupled to pins 15, 1, 2, 3 of chip
; 38. Because the control microprocessor and the internal
mailbus provide only eight biks of data per cycle, the
first four bits of the first eight bits of a trial word
are placed in the first 4-bit register of chip 34, and
the last four bits are placed in the same 4-bit register
of chip 35. The same 4 bit register of chip 36 gets the
~irst four bits of the second eight bits of data, while
1 .

5~
-34-
1 the same 4-bit register of chip 37 gets the second ~our
~its. Finally, the same 4-bit register of chip 38 gets
the next four bits of the trial word ~rom the conkrol
microprocessor over Internal mailbus 114. In summary,
the first 20-bit grammar byt:e o~ the trial word has been
loaded ~rom the control microprocessor into the ~irst 4-
bit registers o~ register ~ile 34-38. The remaining 20-
bit grammar bytes are loaded in a similar manner. There-
fore, the trial word will be stored in the ~orm o~ 20-bit
grammar bytes, just as ~he language woxds are stored as
20-bit grammar bytes in the dictionary ROM. The remaining
20-bit grammar bytes, up to a total of four grammar bytes
or 16 characters totaling 80 bits, are stored in respective
4-bit registers of the register file 34-38. As a result,
all 20 bits representing four characters in a given grammar
byte may be accessed and processed in parallel in a manner
similar to that ~or a given bank of dictionary ~OMs. A
20-bit or 4 character comparison can be made in one clock
cycle. As words are read, grammar byte-by-grammar byte
from the appropriate bank o~ the dictionary ROM 1-10,
they are cc~mpared to respective grammar bytes of the trial
word until either a match is found between words, or all
; of the words in the chosen sector of the dictionary ROM
1-10 are exhausted. The five 74170 chips combined have a
maximum capacity of 80 bits, or four grammar bytes, com-
prising a total of 20 characters.
Inter grammar byte compare latch 39 is a 7474 Dual D
Flip-Flop~ A successful cvmparison (e.g., bit-for-bit
equivalence~ on the appropriate number of grammar bytes
as defined by the segment code will cause DEQ (Data Ec~uals)
on the inter grammar byte compare latch 39 to go high.
When ~EQ goes high produces scan termination logic 40,
the stop scan signal, and a start scan flip-flop in scan
control logic 51 is reset at pin 1. This causes scan
enable inverse ~SCANEN) from scan control logic 51 to go

~2~;5~
1 high. The inter grammar byte compare latch 39 uses the
mailbus clock inverse signal (MBCLK) from inverter 20 for
timing.
Pins 1, 10, and 14 of inter grammar byke compare
latch 39 go to positive 5 volts. Pin 7 is ground. Pin 4
i5 the output of NAND gate 65. Pin 11 is an input to
NAND gate fi5. Pin 8 is an input to N~ND gate 70. Pin 13
goes to TKRST and to pin 10 on scan termination logic 40.
Pin 2 goes to pin 6 o~ comparator 33. Pins 5 and 12 go
lo to pin 2 of comparator 29. Mailbus clock inverse (MBCLK)
is pin 3, and Data Equals (DEQ) is pin 9.
Scan termination logic 40 allows the loading of suc-
cessive 20-bit bytes, even though an earlier byte failed
comparison. The ap~ropriate number of bytes must be com-
pared ~or the selected dictionary segment, since eachdictionary segment contains words of a given length.
This pe~nits a new comparison of a subsequent word in the
same sequence to begin at the start o~ the next language
word. For example, a search for the word "water" may
first produce the dictionary word "waste" for compari60n.
Since a match would not be found ~or the first grammar
byte, the third characters in each word not being the
~; same, a failure would be indicated in the compare latch
39. However, the second grammar byte for the word "waste"
containin~ the letter '!w" would nonetheless be loaded
into the comparators 29-33, and the segment counter would
be decremented, the address coun~er would be incremented,
and the scan limit down counter would be decremented.
This is done, even though a match is not ~ound, so that
the address of the next word for comparison is present in
the address counter. Therefore, the continued cycling
through the remaining gra~nar bytes of a word which failed
comparison merely serves to update the appropriate counters
and registers. Any loss in computing time is relatively
minimal. The scan termination logic 40 also serves to

~L2~8~
-35-
1 reset scan control iogic 51 from pin 8 to stop all counters
when a match is found. Logic 40 may also provide an inter-
rupt to processor 100 through AND gate 66.
Pins 4 and 10 of scan termination logic 40 are the
tokenizer reset inverse (TKR';T) signal. Pin 5 is an input
to NAND gatc 70. Pin 3 goes to pin 13 of dictionary scan
down counter 56. Pin 11 is an output from inverter 62.
Pin 8 is an input to AND gate 66 and to pin 1 of scan
control logic. Pins 2 and 7 go to ground. Pins 1, 13,
and 14 go to positive 5 volts. Pin 6 is the Data Not
Found (DNF) latch. Pin 12 is the output of NAND gate 70.
Tokenizer output multiplexer 41-45 multiplexes tokan
address bus 113, the grammar bus 112, and the tokenizer
status data to the microprocessor. The given multiplexer
used ~or outpuk is determined by address decoder 52, de-
scribed more fully below.
The tokeni2er status is presented to the control
microprocessor 100 from three free pins of the tokenizer
output multiplexer 45. The status will be either Dat.a
Equals (DEQ), Data Not Found (DNF), or Scan In Progress
(SIP), which are found on pins 11, 13, and 15, respective-
:`' ly.
If the control microprocessor 100 interrogates thesymbolic tokenizer and finds the status to be Data E~uals
(DEQ), then the control microprocessor 100 will read the
address of the appropriate bank of dictionaxy ROM 1 through
10 where the word was found. This is done by pro~iding
the appropriate read command address to the mailbus address
line 116. The address is decoded by the address decoder
52, describled below, which provides an enable signal for
tokenizer output multiplexers 41 and 42 for reading the
appropriate data lines from the token address bus 1130
The address is output onto internal mailbus 114 and read
from the bidirectional data bus transceiver 4~, described
below. The address can be read from the output of, or

~25~ 9
-37-
1 settings for, the token address counter 16 19. This can
be done because the token address counter was stopped
when a match was found throllgh compare latch 33 and scan
termination logic 40. That address subtracted by the
5 number of the segment code (the number of address incre-
ments) divided by the dictionary segment number and added
to an offset by the aontrol microprocessor, then becomes
the token for the word.
Bidirectional data bus transceiver 46 is an 8-bit
tri-state data transceiver to inter~acP the symbolic token-
izer with the external mailbus 115. It is used to direct
data ~low between the symbolic tokenizer and the control
microprocessor 100.
Segment Up Counter 47 is a 2-bit binary up counter,
and it is reset prior to reading a new language word from
the dictionary ROM 1-10 to reflect the proper number of
grammar bytes reguired to compare the entire trial word
in the register ~ile 34-38. It progressively selects
grammar bytes from the register ~ile 34-38 for comparison
until the re~uired number oP grammar bytes have been com-
pared with a language word.
Segment holding register 48 stores the selected 2-
bit segment code, designated above as Sl, S2, S3, or S4,
generated by the control microprocessor. The segment
code is taken ~rom the internal mailbus 114 on lines 120
to pins 2 and 12 o~ register 48. The sesment code deter-
mines the number of reads of the given bank of dictionary
ROM 1-10 required to compare a given language word to each
dictionary word during a search. Each read of the dictio-
nary ROM 1-10 loads a maximum o~ four characters into the
comparators 29-33, so that to compare a 16-character word
requires four successive loads.
Pins 1, 4, 10, 13, and 14 o~ segment holding register
48 go to positive 5 volts. Pin 2 goes to pin 1~ of the
bidirectional data bus transceiver 46. Pin 12 goes to

~:25~.8~9
-38-
1 pin 17 of the bidirectional data bus transceiver. Pin 7
is ground. Pins 3 and 11 go to pin 10 of address decoder
50. Pin 9 goes to pin 1 of the segment down counter 49,
and pin 5 goes to pin 15 of the Segment Down Counter 49.
Segment down counter 49 determines when the proper
number of grammar bytes have been accessed by the segment
up counter 47 for comparing with a given language word.
Pin 11 goes to the output of AND gata 67. Pin 13 goes to
pin 12 of scan control logic 51 and pin 14 goes to SCANCLK.
Pin 4 goes to pin 6 of scan control logic 51. This counter
must be reloaded from register 48 each time the proper
number of grammar bytes have been read ~rom the dictionary.
Address decoder 50 is a 74LS138 decoder ~or decoding
the address and write commands from the control micropro-
cessor and providing a strobe to the correct logic compo~nent. It supplies the inverse tokenizer reset signal
(TKRST) which is initiated by the microprocessor to reset
all of the tokenizer logic. Address decoder 50 also sup-
plies the scan address load strobe low byte inver~e
(SCANLDL) to token address counter 16 and 17 through in-
ternal data bus 114. It cau~es the least significant
eigAt bits of the toke~ address counter 1~ through 19 to
be loaded from the control microprocessor 100.
Address decoder S0 also supplies the scan address
load high byte inverse (SC~NLDH3, which loads the most
significant eight bits of the token address counter 16
through 19. only 15 bits of the ~ombined least significant
eight bits and most significant eight bits are actually
used. Decoder 50 also supplies the scan counter load low
strobe ~SCANCTL) to load the least significant eight bits
of the dictionary scan down counter 54-56. Address decoder
50 also loads the most significant four bits of the scan
counter using the scan counter load high strobe (SCANCTH).
The dictionary scan down counter is loaded from internal
mailbus 114 with a representation of beta from the appro-

12S~8~
-39-
1 priate entry of thé memory map.
Pins 5 and 8 of the address decoder 50 are ground.
~ins 6 and 16 are positive 5 volts. Pin 1 goe~ to pin 3
of address decoder 52 and to pin 14 of address bu~er 58.
Pin ~ goes to pin 12 of address buf~er 58. Pin 3 goe~ to
pin 9 oP address buf~er 58. Pin 9 goes to pin 4 of scan
control logic 51. Pin 15 goes to pin 13 o~ scan control
logic 51 and provides the tokellizer reset code. Pin 10
goes to pins 3 and 11 of segment holding r~gister 48.
Pin 4 goes to pin 11 of register file select 59. Pin 11
is the scan counter high byte load strobe (SCANCTH), and
pin 12 is the scan counter low byte load strobe (SCANCTL).
These strobes are used to enable loading of the dictionary
scan down counter 54, 55, and 56 when the control micro-
processor 100 issues the LOAD SCAN ADDRESS WORD COUNTER
LOW BYTE and LOAD SCAN ADDRESS WORD COUNTER HIGH BYTE
write commands. Pin 13 is the scan address load strobe
high SSC~NLD~), and pin 14 is the scan address load strobe
low (SCANLDL3. These strobes are used to enable loading
of the token address counker 16-19 when the control micro-
processor 100 issues t~e LOAD SCAN ADDRESS REGISTER LOW
BYTE and LOAD SCAN ADDRESS REGISTER HIGH BYTE write com-
mands.
The scan control logic 51 is a 7474 Dual D Flip-Flop.
Scan control logic 51 provides the scan enable latch,
controls the operation of the segment up counter 47, and
pro~ides the Scan In Progress code~ Scan control logic
also stops all tokenizer operation when logic 51 i5 reset.
When the scan enable inverse of the scan control logic 51
goes high after the reset signal ~rom scan termination
logic 40, all address counting in the token address counter
; 16 through 19 terminates (see SCANEN input o device 16
of FIG. 2). At the same time, a mailbus interrupt signal
(MBINTO) is produced to get the attention of the control
microproce~sor 100. Also, the tokenizer status can at

-40-
1 this time be read by the microprocessor, and the address
o~ the language word in the dictionary RO~s can be xead
aftex issuing an appropriate command.
Pin 14 o~ scan control logic 51 is positive 5 volts.
Pin 13 goes to pin 15 of address de~oder 50. Pin 4 comes
from pin 9 of address decoder 50. Pin 8 is an input to
NAND gate 65 at pin 13 and to pins 2 and 3 of seyment up
counter 47. Pin 1 goes to pin 8 of scan termination logic
40. Pins 3 and 7 are ground. Pin 10 is the output of
NAND gake 65 through inverters 63 and 64. Pin 6 is the
scan enable inverse (SCANEN) signal. Pin 9 i~ an input
to pin 9 of AND gate ~7. Pin 5 is an input to pin 10 of
AND gate 67 and provides the Scan In Progress tSIP~ signal.
Pin 12 goes to pin 13 o~ the segment down counter 49.
Pin 11 goes to pin 14 of the segment down counter 49 and
is coupled to the SCANCLR signal.
Address decoder 52 i~ a 74LS138 Decoder for dPcoding
the address and write commands of the control microproce~-
sor and for providing a strobe to the correct logic compo-
nent. It selects the tokenizer output multipIexer 41-45
channels for ¢ontrol microprocessor 100 readout.
Pin 16 of address decoder 52 goes to positive 5 volts.
Pin 6 goes to pin 6 of register file select 59 and pin 6
of address select 53 and to inverter 60. Pin 1 goes to
2~ pin 1 of register file select 59 and to pin 18 of address
buffer 58. Pin 2 goes to pin 2 o~ register file select
59 and to pin 16 of buffer 58. Pin 3 goes to pin 1 of
address decoder 50, to pin 14 of address buffer 58, and
to pins 14 of each register file chip 34-380 Pin 15 goes
to pins 1 and 19 o~ tokenizer output multiplexer 41.
Pin 14 goes to 1 and 19 of tokeni~er output multiplexer
42. Pin 1~ goes to pins 1 and 19 of tokenizer output
multiplexer 43. Pin 12 goes to pins 1 and 19 of tokenizer
output mult:iplexer 44. Pin 11 goes to pin 1 of tokenizer
output multiplexer 45, and pin 10 goes to pin 19 of token~

~25~8(!~
-41-
1 izer output multipléx~r 45. Pin 4 goes to pin 3 o~ regis-
ter file select 59 and to inverter 69. Pins 5 and 8 are
ground~
Address select 53 is a 7485 comparator for sensing the
condition o~ the dip switch 57 and for taking the most
signi~icant four bits o~ the nine bits of the mailbus
address write or read commands and enabling the correct
address decoder. ach symbolic tokenizer may have an
address ~rom 1 to 16. The output of address select 53,
is used by the s~mbolic tokenizer tb ascertain if that
symbolia tokenizer has been addressed by the microprocessor
and i~ so, to enable transceiver 46, address decoder 52
and register file select 59.
The mailbus address lines 116, which allow the control
microproc~ssor loO to address the symbolic tokenizer
through the address select 53, include lines 05, 06, 07,
and 08 of the complete mailbus address lines llSu
Dictionary scan down-counter 54, 55, and 56 are 74191
counters and are loaded with a repxe~entation o~ beta9
~20 Scan down counter decrements with each mailbus clock cycle
-~as words are raad ~rom the given banks of ROMs in the
dictionary ROM 1-10 for comparison to the trial word in
the register ~ile 34-38. As discussed above, the dictio-
nary scan down counter is loaded with a representation o~
beta from the appropriate entry of the memory map, e.g.,
the number o~ grammar bytes needed to read the sector
beginning with the starting address read from the appro-
priate entry in the memory map into the token address
counter 16-19. When the dictionary scan down-counter 54,
55, and 56 reaches zero, Data Not Found (DNF) is produced,
causing a mailbus interrupt (~BINTO) to the control micro
processor 100. The mailbus interrupt causes the control
microprocessor 100 to interrogate the status o~ the sym-
;bolic tokenizer. When the status of the symbolic tokenizer
3~ is interrogated, the DNF (Data Not Found) is sent to the

~25~8~
1 control microprocessor 100. That isl the word to be token-
ized was not found in the appropriate bank of the dictio~
: nary ROM 1 through 10.
Dip switch 57 is used in conjunction with address
: 5 select 53 to select the control microprocessor 100 address
of the 6ymbolic tokenizer. More than one symbolia token-
izer may reside on the mailbus.
Address buffer 58 generates the binary address to
supply a 2-bit code to pins 13 and 14 of the register
file 4-38. ~his selects the address of the proper 20-
bit byte of the four possible 20-bit bytes for loading
into the register file 34-38. Address buffer 58 also
selects whether address decoder 50 or register file select
59 is enabled and ~rovides a code for the enabled component
for producing an appropriate output. Address buffer is a
74244 chip.
Register file select 59 uses the input from address
buffer 58 to either select the appropriate register file
34-38 for loading the trial word from the control micropro-
20 cessor 100 or to enahle address decoder 50. The primary
function is to select the appropriate registers in the
register file 34-38.
Pin 16 of register file select 59 goes to positive 5
volts. Pin 10 goes to pin 12 of register files 34 and
35. Pin 9 goes to pin 12 of register files 36 and 37.
Pin 7 goes to pin 12 of register file 38. Pins 5 and 8
are ground. Pin 3 is the output of inverter 69 and is
coupled to pin 4 of address decoder 52. Pin 4 is the
: mailbus clock inverse signal (MBCLK). Pin 1 goes to pin
30 1 of address decoder 52 and to pin 18 of address bufer
58. Pin 2 ~oes to pin 2 of address decoder 52 and to pin
16 of address buffer 58. Pin 6 goes to pin 6 of address
decoder 52 and pin 6 of address select 53. Pin 11 goes
to pin 4 of address decoder 50.
Inverter 60 is part of a 74LS04 digital inverter

g~5~8~
-43-
1 chip. It inverts the enable signal from address ~elect
53 to pin 19 o~ the tri-statle enable of the bidirectional
data bus transceiver 46 to provide the correct polarity
or direction.
Inverter 62 is part o:E a 74LS04 digital inverter
chip. It inverts the SCANCLK signal to SCANCLK inverse
for NAND gate 65~
Inverter 63 and inverter 64 are part o~ a 74LS04
digital inverter chip. They form a seguential propagation
lo delay to increase the pulse width of the preset pulse
from the NAND gate 65 to the scan control logic 51 and
the inter grammar byte compare latch 39.
NAND gate 65 is part of a 74LS38 inverting AND gate.
It generates the preset pulse to the inverters S3 and 64.
:: 15 The preset pulse presets the compare logic posting
flip-flop of the inter grammar byte compare latch 39 which
: causes pins 5 and 12 in the i.nter grammar byte compare
latch 39 to go high. This causes pin 3 of the comparator
29 to go high, which posts a 1 into the comparators 29-33
which indicates that the last compare was true. A failed
comparison thereafter results in a 0 on pin 3 o~ the com~
parator 29 and on the rçmaining comparators until the
preset pulse again presets the compare latch 39.
AND gate 66 is part o~ a 7408 AND gate chip~ When
multiple symbolic tokenizers sit across the mailbus 102,
AND gate 66 causes an interrupt to the control micropro-
cessor 100 when a scan termination by the compare logic
103 is found true, i.e., when the scan is terminated by
scan termination logic 40. This is caused by either a
true signal on the line from termination logic 40 or a
true signal ~or mailbus interrupt input inverse (MBINTI)
from one of the other tokenizers on the mailbus 102. The
appropriate output of AND gate 66 alerts the control micro-
processor 100 that a symbolic tokenizer has completed a
3~ scan cycle. The scan through the other tokenizers are
.~

~255~
~4~-
1 terminated. Interrogation of the toXenizer status will
yield the cause of the termination. MBINTI low indicates
that anokher tokenizer on the bus has already caused an
interrupk, and every other symbolic tokenizer must wait
- 5 for interrogation by the control microprocessor 100 i~
more than one interrupt request was generated.
AND gate 67 i8 part of a 7408 ~ND gate chip. It
generates the load strobe ~or the segmenk counter 49 to
cause it to be reloaded ~rom the segment holding register
lo 48.
Inverter 69 provides an appropriate signal to the
transceiver 4S for input or output and to address decoder
52 to provide a correct output.
FIG. 5 shows the interface 101 of FIG. 1 which inter
faces the symbolic tokenizer to the control microprocessor
100 of FIG. 1. The prototype symbolic tokenizer utilizes
a Commodore 64K computer for its control microprocessor.
Any central processing unit can serve as a control micro-
processor, i~ appropriately interfaced to the symbolic
tokenizer.
A connector 201 attaches the interface 101 to the
control microprocessor 100. This connection i~ made
through the I/0 expansion jack of the Commodore 64K compu~
ter. The pin numbers of the Commodore 64K I/0 expansion
jack are shown on the connector 201.
NAND yate 202 is part of a 7437 inverting AND gate
chip. It provides an enable signal to the bidirectional
bus data transceiver 204.
NAND gate 203 is part of a 7437 inverting AND gate
chip. It provides the mailbus I/0 signal (MBI0) to the
symbolic tokenizer.
The mailbus I/0 signal (MBI0) is high when the mailbus
102 is being addressed by the control microprocessor 100.
Bidirectional data bus transceiver 204 is a 74LS245.
It provides data transfer between the control microproces-
.

~l2558(~9
-45-
1 sor 100 and the symbolic tokenizer.
Buffers 205 and 206 are both 7407 buf~er chlps.
They provide address, mailbus clock, and mailbus read/write
signal is~lation between thle control microprocessor 100
and the symbolic tokenizer. The six l~nes exiting bu~fer
205 are mailbus address lines 116. The uppermost three
lines exiting bu~er 206 are also mailbus address lines.
The lower two lines exiting hu~fer 206 are the mailbus
clock signal (MBCLK) and the mailbus read/write signal
lo (MBR~).
The mailbus interrupt inverse signal (MBINT) supplies
an interrupt signal to the control microprocessor 100
: when a dictionary scan is complete. This may be omitted
if the control microprocessor also does other tasks.
15The external mailbus 115 is the data path between
the control microprocessor 109 and the s~mbolic tokenizer.
The mailbus address lines 116, along with the mailbus
clock and mailbus read/write siqnals, provide the control
microprocessor 100 with the capability to address the
dictionary ROMs 1-10 of the symbolic tokenizer. When sev-
eral sy~bolic tokenizers are connected to the external
~ mailbus 115, ths mailbus address lines 116 also provide
: the addressing signals by which each ~ymbolic tokenizer
recognizes that it is the s~mbolic tokenizer being ad-
dressed. This happens when the address select 53 (of
FIG. 4) decodes the address coming across the mailbus
address lines I16 as that address set into thP dip switches
57.
The method of using the apparatus described herein
will be described with respect to FIGS. 8 and 9. It will
be assumed that a dictionary ROM 104 has been created
according to the arrangement described with respect to
FIG. 6, and that a memory map has been created based upon
the ROM addresses. The memory map is contained in ROM
100A. A software routine has been created which is usable

~L25~8~
-~6-
1 with the control microprocessor to derive tokens from thedictionary ROM address and to derive a dictionary ROM
address and the corresponding language word given a
particular tokenO Additiona:lly, the read and write com-
mands for accessing the tokenizer have also been ~toredin the control microprocessor. These have been defined
as described above with resplect to the flowchart o~ FIG.
7.
The following method description will be o~ the token-
izer 102A. The tokenizer condition is static and the
tokenizer is available for accepting various read or write
commands ~rom the control microprocessor through interface
101. It will be assumed henceforth that all commands to
be accepted by the tokenizer 102A are made by the control
microprocessor through interfa~e 101 to mailbus 102, i.e.,
busses 115 and 115. The control microprocessor will pro-
vide an address code over the mailbus address line 116,
the address bus, and will provide data over the external
mailbus 115~ the data bus. Additionally, the mailbus
input/output, the inverse mailbus interrupt, ground, mail-
bus clock, and the mailbus read/write codes are provided
over pins on the inter~ace 101, as indicated in FIG. 5.
: Given a command from the control microprocessor, the
tokenizer accepts a read or write ~ommand on mailbus ad-
dress line 116, as indicated in block 301 of FIG. ~A.
The tokenizer accepts a 4-bit code from the four most
significant bits of the read or write command and decodes
the 4-bit code in address select 53 by comparing the code
with the settings of the DIP switch 57. If the 4-bit
corresponds to 5SSS, transceiver 46t address decoder 52
and re~ister file select 59 are enabled. If the address
- code does nbt correspond to SSSS, the tokenizer returns
to 300 at the beginning of block 301 to awai~ another
address code~
In block 304, and in subsequent blocks, it i5 assumed

~2~S8~
-47-
1 that the most signlficant four bits of the address code
correspond to the DIP switch sek for the tokenizer. I~
~here was no correspondence, the write or read command on
the mailbus address line 116 would correspond to a di~er-
ent device coupled to the control microprocessor throughmailbus 102. Hereafter, it will be assumed that the proper
address SSSS was provided with the 8-bit address code,
and only the least ignificant five bits on the mailbus
- address line 116 going into address buffer 58 will be
; 10 considered.
In blocks 304-340, blocks 372-382, and blocks 402-
414, write and read commands are being issued by the con-
; trol microprocessor for initializing and starting the
tokenizer or ~or readinq data. In blocks 342-368, the
tokenizer is processing the commands loaded from the con
trol microprocessor to provide either a memory address
of a word to be tokenized, or a language word based on an
addre~s derived from a token in the control microprocessor
(detokenization). ~umber 370 is a flow indicator.
In block 304, a tokenizer reset command is issued by
the control microprocessor to reset the scan control logic
51 and the scan termination logic 40. The segment down
counter 49 and segment up counter 47, are reset, and a 1
is posted on compare latch 39. Specifically, if the ad-
25 dress is 00000, the tokenizer i5 reset by enabling address
decoder 50 to output TKRST from pin 15. The scan control
logic 51 is reset through pin 13~ and the scan termination
logic 40 is reset through pins 4 and 10. Compare latch
39 is reset through pin 13 so that a 1 is posted at pins
30 5 and 12. The segment up counter 47 and segment down
counter 49 are also reset. The tokenizer status then
returns to 300 to await an additional command. Otherwise,
the tokenizer operates as described below for a different
input command.
In blocks 306-328, the register file 34-38 is loaded

~2~
-48~
1 with the trial word. Specifically, the trial word is
separated by the control microprocessor 100 into 20-bit
grammar bytes, each grammar byte keing loaded into regis-
ters in the register file 34-38 in ~ se~uence o~ eight
bits, eight bits, and four bits. Subsequent grammar bytes
are loaded in the same manner. Specifically, if the ad-
dress is 00001, the first four bits of register files 34
and 35 are enabled through pin 10 o~ register file select
59 and pins 12 and 14 of buffer 58 so that the first four
b.its of each of register ~iles 34 and 35 will accept the
respective four bits from the internal mailbus 114 through
transceiver 460 As mentioned above, the ~irst grammar
byte is loaded first by loading the first eight bits of
the grammar byte through the external mailbus 115. The
next eight bits are loaded according to block 308, and
the last four bits of the first grammar byte are loaded
according to block 310. In any case, the tokenizer is
ready for another command by returning to 300.
: Specifically with respect to block 308, i~ the address
is 00010, the first four bits of register files 36 and 37
are enabl2d through pin 9 of register file select 59 and
pins 12 and 14 of buffer 58. The first bit of each regis-
ter 3Ç and 37 will accept respective four bits from the
eight bits of data coming from the internal mailbus 114.
In this manner, eight bits of data can be loaded in paral-
: lel to the respective registers in a single clock cycle.
After the first eight bits are loaded, the tokenizer is
then ready for loading of additional bits of the first
grammar byte. In block 308, this is r~presented by a
return to 300.
~ In block 310~ if the address code from control micro-
processox lt)Q is 00011, the first four bits of register
file 38 are enabled through pin 7 of register file select
53 and pins 12 and 14 of bu~fer 58. The first four bits
will then accept the last four bits o~ the first grammar

~25~
~49-
1 byte from internal mailbus 114 to complete the loading o~
the first grammar byte into register file 34-38. Null
characters are loaded into the appropriate bits 1n the
register ~ile in the situation where the trial word does
not take up an entire grammar byte. The tokenizer is
then ready to accept an additional address code in the
form o~ a read or write command. This is represented in
block 310 by a return to 300.
In block 312, if thQ address is 00101, the second
four bits of the register files 34 and 35 are enabled
through pin 10 o~ register file elect S9 and pins 12 and
14 of buffer 58. The second four bits of each register
will th~n be able to accept respective four bits ~rom the
first eight bits of the next grammar byte ~rom internal
mailbus 114. This will occur only i the trial word is
more than four characters in length, i . 8 ., requires more
than ons grammar byte. The tokenizer i~ then ready to
accept another address.
In block 314, if the address is 00110, the second
four bits of register files 36 and 37 axe ena~led through
pin 9 of register file s~lect 59 and pins 12 and 14 o~
buffer 58. The respective second ~our bits can then accept
respective four bits of the next eight bits in the second
~; grammar byte from internal mailbus 114. The tokenizer i5
then ready to accept a further read or write co~mand.
In block 316, if the address is 00111, the second
four bits of register ~ile 38 are enabl~d through pin 7
of register file select 59 and pins 12 and 14 of buffer
58. As a .result, the second four bits of register file
38 will accept the remaining four bits from the second
grammar byte off of internal mailbus 114 In summary,
three clock cycles are used to load the second grammar
byte rom internal mailbus 114 into the registers 34-38
of the register file. If the trial word did not require
a complete second grammar byte, null characters are in-

~Z5~
--so--
1 serted in the bits not re~lired. If the word requiresonly two 20-bit grammar bytes, the remaining bits in the
ragister ~ile are not used.
I~ additional grammar bytes are required to store
the trial word in register f.ile 34-3R, the control micro-
processor will issue a write command having an address of
01001. In block 318, this address causes the third ~our
bits of register files 34 and 35 to be enabled through
pin 10 o~ register file select 59 and pins 12 and 14 of
buffPr 58. The third four bits of each register will
.- then accept respective four bits from the first eight
bits of the third grammar byte off of the internal mailbus
114.
In block 320, the next eight bits of the third grammar
byte are loaded into the register file. Therefore, if
the address is 01010, the third four bits of register
files 36 and 37 are enabled through pin 9 of register
fil~ select 59 and pins 12 and 14 of buf~er 58. The third
four bits of the two register files will accept respective
zo four bits from the second eight bits of the third grammar
byte o~f of internal mailbus 114.
In block 322, if the address is 01011, the third
four bits of register file 38 are enabled through pin 7
of register file select 59 and pins 12 and 14 of buffer
58. The third four bits will accept the last four bits
of the third grammar byte from internal mailbus 114. In
this manner, three grammar bytes, or up to twelve charac-
ters, are loaded into the register file i~ nine clock
cycles. If an additional ~rammar byte is required to
store the entire trial word, an additional write command
is issued by the control microprocessor, as discussed
below.
In block 324, the first eight bits of the fourth
grammar byte are loaded into the register file. If the
address is 01101, the ~ourth four bits of register files

~L2~
-51-
134 and 35 are enabled through pin 10 of register file
select 59 and pins 12 and ~.4 of buffer 58. The fourth
~our bits will then accept respective four bits of the
first eight bits of the fourth grammar byt~ off of internal
mailbus 114.
In block 326, if the address is 01110, the fourth
four bits of register files 36 and 37 are enabled through
pin 9 o~ register file select 59 and pins 12 and 14 of
buffer 58 The ~ourth four bits will then accept respec-
: lo tive four bits from the second group o~ eight bits of the
fourth grammar byte off of internal mailbus 114.
In block 328, if the address is 01111, the fourth
~our bits of register file 38 are enabled through pin 7
of register file select 59 and pins 12 and 14 of buffer
58. The fourth four bits will then ac~-ept the last four
bits of the fourth grammar by~e from internal mailbus
114. After this step, the register ~ile will contain all
four grammar bytes required to store the trial word. The
:tokenizer will then be ready to accept a further writ~
or read command, as appropri~teO In summary, if all regis-
ter bits are filled, a 16-character trial word has been
stored into the register ~ile using twelve clock cycles
to store eighty bits of in~ormation. ~his is done by
reading representations of characters into the register
file in parallel and ~uccessi~e reads of eight bits, eight
bits, and four bits. Wh~n the register file contains
the trial word, the tokenizer is then ready to accept
~urther commands.
It should be noted that four address bits are signif-
30 icant for appropriate loading of the register ~ile 34-38.
The codes used to access the register file begin with the
first address described in block 306, namely, 00001, and
end with the address discussed in block 32~, nam~ly, 01111.
Though only pins 12 and 14 of address buffer 58 and 7, 9,
3S and 10 of re.gister file select 59 were discussed in blocks

~Z~;S8~
-52
1 306-328, it is to ~e understood that all four bits of the
address were used to access and load the register file
34-38. For example, pins 16 and 18 o~ bu~fer 58 were
used to produce the appropriate output signal from pins
7, 9, or 10 of registsr ~ile select 59.
In block 330, the control microprocessor issues a
write command for loading into the token address counter
16-19 part of the starting address of the sector to be
searched. Specifically, if the address is 00100, the
lo address decoder 50 is enabled through low signals on pins
16 and 18 of buffer 58 and a high on pin 11 of register
: file select 59. This causes a 3-bit code to be sent from
pins 9, 1~, and 14 of address buffer 58 to the decoder 50
to strobe token address counter-register 16 and 17 from
pin 14 of the decoder. Upon the production of the strobe,
SCANLDL is sent to counters 16 and 17 so that the counters
will accept the low eight bits of the starting address
from the internal mailbus 114 through transceiver 46.
The tokenizer is then ready to accept the high eight bits
f the starting address to be loaded into counters 18 and
19 .
In block 332, the remaining eight bits of the starting
address are loaded into the ounters 18 and 19 over the
internal mailbus 114. Specifically, if the command address
25 is 01000, the address decoder 50 is enabled as in block
330 to strobe token address counter-register 18 and 19.
A strobe SCANLDH is provided to counters 18 and 19 from
pin 13 of decoder 50 to accept the high bits of the start-
ing address from ~.he internal mailbus 114. The token-
izer is then ready to accept a further read or write com-
mand.
In block 334, a portion of khe value of beta, the
total number of grammar bytes to be read to search through
the entire sector under consideration, is loaded into the
dictionary scan down counter 54-56. Specifically, if the

~2~S191C~
~53-
1 command address is 01100, the address decoder 50 i5 enabled
as in block 330 to strobe the dictionary scan down counter
54 and 55 to ~ccept the low eight bits o~ beta ~rom the
internal mailbus 114. The upper four bits can then be
loaded into counter 56.
In block 336, the remaining ~our bits of the beta
value are loaded into the counter 56. Therefore, if the
command address is lG000, the address decoder 50 is enabled
as in ~lock 330 to strobe the dictionary scan down counter
56 using the SCANCTH from pin 10 o~ decoder 50. The dic-
tionary scan down counter 56 then accepts the high four
bits o~ beta from the internal mailbus 114.
In block 338, ths segment code is loaded into segment
holding register 48 according to a 2-bit code read from
the internal mailbus 114 through bidirectional transceiver
46. Specifically, if the control microprocessor outputs
an address of 10100, address decoder 50 is enabled through
low signals on pins 16 and 18 o~ the bu~fer 58 and a high
signal on pin 11 o~ register file select 59. A 3-bit
code is sent from pins 9, 12, and 14 of buffer 58 to the
decoder 50 to provide a load strobe from pin 15 of decoder
50 to cause the segment holdin~ register 48 to accept the
2-bit segment code. The segment code is taken over lin~s
120 from the internal mailbus 114 through transceiver 46.
The tokenizer is then available during th~ next clocX
cycle to accept another command. Otherwise, a difPerent
command and address code was provided by the control micro-
processor and is processed as provided hPrein.
It should be noted that the above-described sequence
3~ of steps is not significant for operation of the tok~nizer.
As long as the various components are reset and loaded
with the appropriate in~ormation, the particular order in
which the i:nformation is loaded into the components is
not significa~t. However, the appropriate values should
3~ be entered :in the registers prior to issuing the start
.
.- .. ~ , .

-54-
1 scan command, since the tokenizer functions as an indepen-
dent processor once the re~ired parameters are loaded.
In this way, the control microprocessor can do separate
tasks while the t~kenizer is operating. However, once the
appropriate registers are loaded, the start scan command
may be issued.
In block 340, the toXenizer is issued a command to
start scanning ~or the word to be tokeniæed. Since the
trial word is already storsd in the register ~ile 34-38,
lo and the start address is loaded in token address counter
16-19, the counters have been loaded with the appropriate
values, the toXenizer can step throu~h the language words
stored in the appropriate bank o~ dictionary ROMs 1-10 to
compare the words with the trial woxd. This is done in a
grammar byte-by-grammar byte sequence where the grammar
byte of the tr~al word is compared with a respective gram-
mar byte of ~ language word found in the dictionary ROM,
all during a single clock cycle. If a match is ~ound
a~ter all grammar bytes o~ a given language word have
been compared, a data equals (DEQ~ i~ issued, and the
scan is terminated. If the end of the sector is reached
without a match being indicated, the data not found signal
(DNF~ is issued, and the tokenizer awaits a further com-
mand. The actual tokenization process is depicted in
25 blocks 342-368 of the flowchart of FIG. 8. Therefore, if
the write address is 11000, the address decoder 50 i5
enabled as in block 330 to strobe the scan control logic
51 to set the SCANEN latch ~or token address counter 16-
19. The latch is also prDvided to the segment down counter
30 49 and to the dictionary scan down counter 54-5~ so that
those components are enabled. The scan control logic 51
also provides a scan in progress ~SIP) code to the internal
mailbus 114 through multiplexer 45 so that a check by the
,microprocesc;or 100 will show that the tokenizer is scan-
; 35 ning. Thereafter, ths tokenizer operates as described

~5~8~
- -55-
1 below without further input from the control microprocessor
until the scan is terminatedO
In block 342, during the first clock cycle, the scan
control logic 51 provides a high signal to pin 10 o~ AND
gate 67, along with the SIP code, and provides a second
high signal from pin 9 of scan control logic 51 to pin 9
of AND gate 67 for providing a load strobe to pin 11 of
segment down counter 49. This allows loading of the 2-
bit segment code ~rom the segment holding register 48.
lo For a five-character word such as "water", two reads or
two grammar bytes are required to completely access the
language word from the appropriate bank of dictionary
: ROMs 1-10.
In block 344, the token address in counters 16-19,
including the token addresses TKAD00 and TKAD14 and the
enable signals from AND gates 25 and 26 allow access to
the first grammar byte of the ~irst word under comparison
in the given bank of ROMs 1-10. The particular high or
~: low bits o~ each ROM are selected through the multiplexers
ll-lS by the address TKAD00, and the particular bank of
ROMs is selected by the address TRAD14. The actual address
in each ROM is determined by the contents of token address
counter 16-lg. Given the ~ddress in the token address
counter, the first grammar byte of the first word under
comparison i~ provided through multiplexers 11-15 to the
~ bank of comparators 29-33. The first grammar byte is
compared with the first grammar byte in register file 34-
38, as selected by the 2-bit code from segment up counter
47. For example, i~ the trial word is "water", and the
first language word in the particular sector of dictionary
: ROMs 1-10 is "waste", two grammar bytes are required to
. make a complete comparison. The tok~n address counter
will have selected the first grammar byte for the word
"waste", and the segment up counter 47 will have selected
the ~irst grammar byte from the register file 34-38. For

~l~551!37C~
-56-
1 example, the segment up counter may provide a code 00 to
indicate the first four bits of the 4-bit register slice
in each register 34-38 to be output to the comparator
bank 29-33.
In block 346, a determination is made as to whether
or not the dictionary language word grammar byte is iden-
tical to the corresponding trial word grammar byte under
comparison. As discussed above, the tokenizer reset caused
a 1 to be posted at pins 5 and 12 o~ the compare latch 39
to be input to the comparators 29-33. The comparators
are linked by a daisy chain connection ~o that a 1 posted
on the input of comparator 29 is also posted on the input
of comparators 30, 31, 32, and 33. As a result, the output
from pin 6 of comparator 33 also indicates a 1 for input
to pin 2 of latch 39. However, i~ any comparator does
not indicate a match, the output of that particular com-
parator goes to zero, and the output from the subsequent
comparators, including the output from pin 6 of comparator
33, goes to zero. Thereafter, a zero is provided to pin
20 2 of latch 39 and is output ~rom pins 5 and 12 of the
compare latch until the compare latch 39 is reset.
Assuming that there is no match in the first grammar
byte under comparison, a ~ero is posted on pins 5 and 12
of the compare latch 39, indicating a mismatch. See ~lock
348. As a result, the output ~rom pin 9 of compare latch
39 remains low, indicating the lack of "data equals".
The scan limit down counter 54-56 is then tested according
to block 350 to determine whether or not the down counter
has reached zero. As discussed above, the down counter
is initially loaded with beta, the total number of grammar
bytes re~uired to be read from a given sector to read all
; of the words contained in that sector. By way of example,
a sector comprising 5-character words, beginning with "w"
and containing five language words, would be stored in
;35 ~ive groups of two grammar bytes each so that a total of

~l2~i8~
-57-
1 ten grammar bytes would be requirecl to be read in order
to scan the contents o~ the s~ector.
I~ the scan limit down counter had reachecl zero,
indicating that all gra~mar hytes of the words contained
in the given sector had been rlead as in block 352, a signal
is sent ~rom dictionary scan down counter to pin 3 o~
scan termination logic 40, which then provides a low output
from pin 5 to N~ND gate 70. The output o~ N~ND gate 70
is then fed back to pin 12 of scan termination logic 40
as a high signal, which produces a low from pin 8. The
low æignal ~rom pin 8 provides a reset to pin 1 of scan
control logic 51, which causes the scan in progress (SIP3
to go low, The scan enable coda (SCANEN) is also stopped
to stop all the counters in the tokenizer. Simultaneously,
a data not found code ~DNF) is produced at pin 6 of scan
termination logic 40 to be output through pin 1~ of multi-
plexer 45. The countdown of the dictionary scan down
counter to `zero indicates that all dictionary words in
the sector have been scanned and compared. ln the example
where the given sector under consideration contains ~ive
words, five letters ~ach, ten grammar bytes will have
been read over the course of ten clock cycles. After
the tokenizer is stopped and when the data not ~ound signal
is produced, the tokenizer waits ~or a read or write com-
mand,
Where the scan limit down counter has not yet reachedzero in block 350, the tokenizer then tests to see whether
or not the secfment down counter 49 has reached zero, as
indicated in block 354. In the example of "water", the
segment code originally indicated that two grammar bytes
must be reacl ~rom the appropriate bank o~ the dictionary
ROMs 1-10 in order to compare the entire word. Since
only one grammar byte had been read, the segment down
counter shows a representation o~ "1", having been decre-
mented once.

~zss;~9
-58-
1 Where the segment down counter has reached zero, as
indicated in block 365, the scan control logic 51 is given
a signal from the segment down counter 49 to test the
compare latch 39 to determine ~he resulks o~ the compare
of the two words. Since block 346 produced a mismatch,
scan outputs a load strobe from AND gate 67 to reload the
segment down counter 49 ~rom the ~egment holding register
48. The signal to scan control lo~ic 51 also resets the
segment up counter 47 through the output from pin 8 of
scan control logic 51. The same signai resets the compare
latch 39 to post a 1 on pins 5 and 12 o~ latch 39 and
increments the token address counter and decrements the
dictionary scan down counter. This indicates that all
grammar bytes ~or a given word have been read, but that
subsequent words remain to be compared in the comparators
:29~33, since dictionary scan down counter had not reached
zero. The compare process then repeat~, as discussed
above and as described below.
If the segment down counter has not yet reached zero,
the segment up counter 47 is incremented as indicated
in block 360. This provides a signal to the register
files 34-38 to allow parallel access to the appropriate
4-bit registers in each regis~er 34-38. Additionally,
the segment down counter is decremented, and the dictionary
~:25 scan down counter is al50 decrementedO Finally, the token
address counter is increment~d to allow access to the
next succeeding grammar byte in the appropriate bank of
dictionary ROMs 1-10. The compare process is then xepeated
through block 344 for each succeeding grammar byte and
subsequent language words on a grammar hyte-by-grammar
byte basis (since a mismatch was found ~or the word under
comparison).
Returning to block 346, i~ a match is indicated, the
flow continues to block 362. In this case, the daisy
chain of comparators 29~33 have found a match of the re-
:~ ,

~2S~8(~3
-59-
1 spective bits being compared and maintain a 1 at the output
from pin 6 of comparator 33. Therefore, a 1 i~ maintained
on pins 5 and 12 o~ compare lakch 39. For example~ if a
match was not ~ound for "waste", and if the next word in
the dictionary ROMs is "water", a comparison o~ the first
grammar byte for the word "water" ~rom the dictionary ROM
with the first ~rammar byte from register ~iles 34-38
would produce a matchJ Thenl the scan limit down counter
would be tested as in block 362 to dete~mine whether or
not the counter had reached zero.
Assuming for the moment that the scan limit down
counter did not reach zero, block 362 continues to blocX
364. The tokenizer then determines whether or not the
segment down counter has reached zero. This step deter-
;15 mines whether or not all the grammar bytes for the trial
word "waterl' have been read and compared. In the case
where the second grammar byte for "water" has not been
compared, the tokenizer increments the segment up counter
47 to designate a new register file location ko be accessed
and decrements the se~ment down counter with the nex~t clockcycle. The tokeni~er also decrem2nts the dictionary scan
down counter and increments the toXen address counter so
that the next grammar byte in the dictionary ROMs can be
;accessed. These steps are carried out in block 360. The
operation then returns to the input o~ block 344 to conduct
an additional comparison.
Continuing to block 364, and assuming that a match
was found for the word "water" in the comparison of the
second grammar byte, the segment down counter will have
reached zero, since a representation of the segment code
for two grammar bytes in the segment down counter 4g will
have been clecremented to zero. This indicates that all
grammar bytes for the word "water" have ~een read and
compared. A reset signal is then provided to pin 12 o~
scan control logic 51, which provides an output signal at
.. .

-60-
1 pin 8 to pin 11 of compare latch 39. This tests the status
of compare latch 39 and finds a high signal at pin 9,
indicating DEQ. At the same time, a low signal is output
from pin 8 o~ latch 39 to NAND gate 70, which provides a
high signal ~o pin 12 of ~can termination logic 40. This
produces an interrupt ignal 'at pin 8 o~ logic 40 to reset
scan contr~l logic 51. As a result, the scan in progress
(SIP~ goes low, and the SCANEN goes high. This stops all
counters and maintains the token addre~s counter 16-19
with the same address as was present during the last read
for which a match was ~ound between the language word and
the trial word. The tokenizer is then available to acc~pt
additional input such as a read command.
Assuming now that the dec.ision block 362 indicates
that the scan limit down counter has reached zero, and
that a match was indicated for comparators 29-33, block
368 is entered. It should be apparent that, i~ the scan
limit down counter reaches zero, the segment down counter
49 has also reached zero. In that situation, a signal is
sent to ~in 3 of the scan termination logic 40 from pin
13 of the dictionary scan down counter 56. Additionally,
the segment down counter 49 has sent a signal to scan
- control logic 51, which produces an output at pin 8 for
testing the status of compare latch 39. Since the output
at pin 9 of compare latch 39 i6 high, indicating DEQ, an
output is provided from pin 8 to NAND gate 70 which pro-
duces a low pulse to pin 12 of scan termination logic
40. As discussed above, pin 8 provides an interrupt signal
to AND gate 66 and a reset to scan control logic 51 through
;30 pin 1 Qf logic 51. This stops all counters and terminates
the SCANEN and stops the compare. The tokenizer is then
available ~or reading the address of the matched word.
The address may then be used to produce a token using the
software of control microprocessor 100~
3~ In summary, the sequence of operations for loading and
.
.

1 starting the tokenizer have been described. The operation
of the tokenizer, after appxopriate parameters have been
loaded, has also been described. Once the tokenizer pro-
cesses the trial word and the language words ~tored in
dictionary ROMs 1-10, output signals are provided indi-
cating the status o~ the tokenizer with respeat to the
processing conducted by the tokenizer. During that pro-
cess, an address may be derived which can be used by the
control microprocessor to pxoduce a token, as is well
known in the art.
In this regard, it will be assumed that an address
for a match~d word resldes in token address counter 69-
19. In block 372, the control microprocessor provides
a read command to access the address for the purpose o~
deriving a token. Specifically, if the address is xx000
and M~RW is high, use decoder 52 to enable reading o~ the
least significant eight bits of the token address counter
69-19 through token address bu~ 113 and external mailbus
115. Then return to 300 to await a read command for the
remaininy bits of the tokPn address.
Tn block 274, the remaining bits of the token address
for the match trial word is read from the token address
counter 69-19. Specifically, i~ the read address is xx001
and MBRW is high, use the decoder 52 to enable reading of
the most significant seven bits o~ the token address and
return to 300 to await a further command.
Once the control microprocessor has read the 15 bits
of the token address, a token can be derived as discussed
above.
30The next succeeding three blocks 376-380 are used
to effectively translate a token into a human language
word. This process is called detokenization and is dis-
cussed below with respect to Fig. 9. However, assuming
the control microprocessor issues a read command to read
the ASCII or EBCDIC representation o~ a word contained in
.
..
.

;;8Q~
-62-
1 the di,ctionary ROM corresponding to khe address, the lan-
guage word can be output to the control microprocessor on
a grammar byte-by-grammar byte basis. Specifically, in
block 376, an address xxOO10 and a high signal ~or MBRW
causes GBLOCK is low. The clecoder 52 is used to enable
reading of the first eight bits o~ the grammar byte of
the language word contained ir~the appropriate bank of ROMs
1-10 .
I~ the control microprocessor issues an address xxOll
and a high signal ~or MBRW, the decoder 52 is used to
enable reading of the second eight bits o~ the grammar
byte of the language word contained in the given bank o~
ROMs 1-10. Finally, in block 380, an address xxlOO and
a high for MBRW allows decoder 52 to enable reading o~
the last four bits of the grammar byte of the language
word. If the particular word corresponding to the token
requires more than one grammar byte for storage of the
word, the remaining grammar byte or grammar bytes can be
accessed in the same manner.
The final read command the control micxoprocessor
can produce is the address xxlOl. I~ MBRW is high, the
decoder 52 enables reading of the tokenizer status bits
from the pins of multiplexer 45. The tokenizer then rs-
turns to 300 to await a subsequent command and returns
to 300 if the tokenizer ~ails to recognize any of the
commands output by the control microprocessor.
During the process of detokenization of a token to
produce a natural language word, the read and write com-
mands described are used. Considering the flow chart
of Fig. 9, detokenization is initiated by the control
microprocessor first issuing a tokenizer reset command
for resetting the tokenizer as indicated in block 402.
The control microprocessor then utilizes an appropriate
routine to derive a segment code and address from the
particular token for which a natural language word is
,

X5~0~
-63-
1 desired from ROMs 1-10. The segment code and address are
used as data and placed on the data bus 115 when the appro-
priate address code is produced at the address bus.
Specifically, block 406 indicates that the control micro-
processor issues a write command 5sssOOlOO ~or loadingthe low byte o~ the scan add:ress register. When the ad-
dress decoders decode the write command, the low byte
o~ the scan address register is loaded with the least
significant bits o~ the address to be accessed. Block
408 indicates the control microprocessor issues the address
ssssO1000 to load the high byte or the scan address xegi~-
ter. This write command is also decoded in the decoders
52 and 50. The most significant bits of the address are
then loaded in the token address counter 16~
~eading o~ the language word ~rom the appropriate
bank of ROMs 1-10 i8 accomplished on a grammar byte-by-
grammar byte basis wherein representations of each grammar
byte of the language word are transmitted in a ~equence
of eight bits, eight bits, and four bits ovex three clock
: 20 cycles. Specifically, the control microprocessor issues
a MBRW code and an address ssssxxO10 in order to read the
first eight bits of the appropriate grammar byte ~rom
grammar bus 112. The next eight bits and the last four
bits of the particular grammar byte are read sequentially
after the control microprocessor issues the read command
ssssxxOll and ssssxxlOO.
As indicated in block 412, additional grammar bytes
may be required to be read if the segment code derived
by the control microprocessor from the particular token
indicates that more than one grammar byte must be read
to obtain the entire language word. If the segment code
is a representation of "S2" the above steps in blocks
406, ~08 and 410 are repeated once by incrementing the
address by one and reloading the incremented address into
the token address counter. Similar comments apply if the
, . .

~s~
64-
1 segment code is a representation of "S3" or ~S4". The
control microprocessor then waits ~or ~he next read or
writa command.
Using the above described method and apparatus, natu-
S ral language words can be transformed into tokens accordingto addresses of common words contained in a dictionary.
The tokens derived in the tokenization process can be
used to store files in a compressed form ~r to provide
textual txansmission ov~r a substankially decreased amount
of time. When the language word is to be retrieved using
the token, the token is manipulated to derive an address
and a segment code ~or the language word corresponding to
the address. The token may be processed to provide the
language word so that the tokenized text can be displayed
or printed.
A further application o~ the method and means de-
scribed herein is for tokenizing strings o~ words such as
phrases. There~ore, not only can common words be repre-
sented by tokens~ but common phrases including common
words which are already stored in the dictionary ROM can
also be represented by a single token. This application
provides for even greater reduction in required storage
space or transmission time over regular tokeniæation.
The same apparatus as described above can be used for
phrase tokenization. However, the dictionary RO~ 104
contains additional list5 comprising a keytoken list and
a phrase table.
The keytoken list is preferably comprised of 16-bit
tokens~stored as a 20-bit grammar byte in a given bank of
; 30 ROMs 1 5 or 6-10. The last four bits of the grammar byte
are not used. Additionally, phrases of up to four words
in length are the preferred size phrases to be toksnized.
It has been found that the effectiveness decreases with
larger strings of words. It should be noted that 16-bit
tokens are used ~or keytokens, rather than 15-bit tokens,
' ,
' , : '

-65-
1 to take full advantage o~ the 30,000 language words storage
capability of ROMs 1-10. ~n the prototype discussed above,
15-bit tokens wee used, because the full capacity o~ the
address register was not used.
The keytoken list is ess~ntially a 1 x N array of
tokens representing keytokens. In the language sense,
a keytoken is taken to be the skarting word o~ any common
phrase which has been represe~nted by a token in the dic-
tionary ROMs 1-10. The phrase is then represented by a
single 16-bit token in ROMs 1-10. If the common phrase
has four words in it, those four words in the de~ined
sequence are represented by a single 16-bit token. The
same is true of three word phrases and two word phrases.
The keytoken list is arranged accordiny to the al-
phabetic starting letter of the word w~ich is a potential
keytoken and the relative address at which the keytoken
is found for the given ~tarting letter. For example, all
of the possible keytokens having a starting letter o~
"a" are listed horizontally on the first row of Fig. 10.
Each keytoken entry is in the form Qf a 16-bit keytoken
byte storPd in a portion of the dictionary ROMs 1-10.
All the Ka entries in Fig. 10 represent predefined key-
tokens representing words having an alphabetic beginning
letter of "a".
A portion of the dictionary ROMs 1-10 also includes
a phrase table such as the phrase table schematically
; depicted in Fig. 11. This portion of the dictionary ROM
includes a two dimensional array o~ token triplets, Sl-3.
Each triplet has a phrase table address and a corresponding
keytoken address. For example, the keytoken address 0
from Fig. 10 corresponds to the keytoken address 0 in
the phrase table in Fig. 11. Each entry or triplèt com-
prises three tokens, the ~irst token being the token for
the first word after the keytoken word in the phrase,
the second token representing the second word in the phrase

~zs~ 9
-66-
1 a~ter the keytoken word, and the thlrd token representing
the third word in the phrase afker the keytoken word.
Each triplet in a given column under the keytoken address
are tokens representing potential second, third, and fourth
words in a phrase beginning with the word represented by
the keytoken found in the keytoken list of Fig. 10. Each
entry in the phrase table comprises a triplet of tokens
representing at most three words in a phrase. Each triplet
is entered in the phra~e table according to a starting
lo address darived ~rom the keytoken list address of Fig. 10
and the particular phrase table addres~ for the given
triplet. Therefore, for a given starting address in the
phrase table, a search can be conducted through each of
the triplets to determine whether or not the trial text
phrase is included in one of the triplets under the asso-
ciated starting address. If a match is found between
a triplet and the trial phrase, the phrase table address
corresponding to the matching triplet is used by the con-
trol microprocessor to produce a token for the given
phrase. The token is then used by the control micropro
cessor in the same manner as any othex token representing
a single language wordO
Finally, the control microprocessor includes a memory
map in the form of a phrase table memory map such as that
depicted schematically in Fig. 12. The memory map includes
two entries for corresponding to a plurality of keytoken
addresses. The keytoken address is that derived from
the address of Fig. 10~ The phrase table memory map is
used to derive a starting address for the phrase table of
Fig. 11 and the total number of grammar bytes required to
read a particular set of triplets corresponding to the
starting address derived from the memory map. For example,
a keytoken word which is the starting word for a ~our
word phrase and which has a beginning letter of "a" may
match the keytoken Kal. The control microprocessor then
,.- : '

~2SS~9
-67-
1 reads the address, namely 1, and accesses the phrase table
memory map to determine the starting address and the total
number of grammar bytes so that tha trial phrase, in the
form o~ tokens, can be compared to the stored tokens for
the given starting address. For example, the address
derived from the keytoken list o~ Fig. 10 corresponds to
a starting address for the phrase triplets of 101 and
includes a total of lS gram~ar bytes. This information
is used to load the token addres counter 16-19 with the
lo starting address and to load the dlctionary scan down
counter 54-56 with a representation of the total number
of grammar bytes, 15. The segment holding register 48 is
loaded with a representation of 3 since the triplets in
the phrase table in the dictionary ROM comprises three
tokens.
Considering the ~low chart of Fig. 13 in conjunction
with the tables of Figs. 10-12, the process of phrase
tokenization will now be described. It will be assumed
that a block o~ text is available in the control micro-
processor ~or tokenization. For each word in the text, atoken is derived as discussed above with respect to Fig.
8. However, the dictionary ROMs now contain a keytoken
list and a phrase table in addition to the stored library
of language words. Therefore, phrase tokenization includes
the additional step o~ determining whether or not a token
can be ~ubstituted for an entire phrase of up to four
language words. As indicated in Fig. 13, a language word
is retrieved from the text contained in the control pro-
cessor and tokenized using the tokenizer 102a. During
block 504, the token is then used as a potential keytoken,
~amely the ~irst word in a potential phrase to be substi-
tuted by a token. The control microprocessor accesses a
memory map for keytokens, such as that depicted schemat-
ically in Fig. 13, using the starting lekter of the origi-
nal text wo:rd to determine a starting address and a key-

~ls~
-68-
1 token segment length for searching in a keytoken list.
The start address is the address in dickionary ROMs 1-10
at which the search for the keytoken i~ to begin. As
indicated above, the keytoken list contain~ the first
word of all phrases represented by tokens in the dictionary
ROMs. The keytoken segment l~ength is the number o~ gram-
mar bytes required to be ~earched and compared for all
- keytokens of a given skarting character. This step i~
similar to tokenization as discussed above.
During block 506, a search is conducted of the key
token list in ROMs 1-10 for a keytoken match using the
start address obtained from the memory map and the keytoken
segment length. Specifically, the start address is loaded
into token address counter 16-i9 with an appropriate write
1~ address code. This will allow acces~ to the keytoken
li~t in dictionary ROMs 1-10. The dictionary scan down
counter 54-56 is loaded with a representation of the key-
token segment length to serve the same ~unction as the
term beta used in conjunction with tokenization~ The
dictionary scan down counter is loaded using an appropriate
address code to the mailbus address line 116. The segment
holding register 48 is loaded with a representation of
; 1 since all keytokens ar~ one grammar byte in length.
The segment holding register is loaded using an appropriate
address to address buffer 58. Finally, the keytoken is
loaded into register file 34-38 with an appropriate address
write command in a manner similar to that discussed above
with respect to tokenization. A start scan command is
then provided to the mailbus addre~ line and the tokenizer
then begins searching through the keytoken list for a
match.
As indicated in block 508, a match is indicated in
the same manner as discussed above with respect to the
description in conjunction with Fig. ~ for tokenization.
If a data not found command is issued, the control micro-

~25~8~
-6~-
1 processor returns to retrieve the next subsequent language
word for repeating the above-described process. If a
match is found, the word from the text is a keytoXen and
is th~ first word in a potential phrase in the text.
Therefore, it must be determ~Lned whether or not the word
in the text is ~ollowed by at: most three words which are
identical to a phrase in the phrase table. To this end,
the control ~licroprocessor reads the address from the
keytoken list at which a match was found between the trial
token and the kaytoken.
As indicated in block 510, the keytoken address i5
used to search the phrase table memory map ~or the start
address of the potential phrase and the segment length
in the phrase table for those potential phrases having
the keytoken as the beginning word.
The phrase table memory map is depicted schematically
in Fig. 12. Given the keytoken address from the keytoken
list, for example address 1, the ~tartiny address of the
three potential words or triplets in the phrase table and
the total number of grammar bytes needed to search through
all the triplets having the keytoken as the first word
can be found. For example, where the keytoken address is
1, the start address of the phrase triplets in the phrase
table i5 101. The total number of grammar bytes needed
to access all the token triplets corresponding to the
keytoken is 15. It should be noted that the triplets to
be searched in the phrase table are tokens rather than
language words. The tokens were defined during ereation
of the dictionary ROM in a manner similar to the creation
of the dictionary library discussed above with respect to
Figs. 1-9.
During block 512, the control microprocessor tokenizes
the three words coming i~mediately after the trial keytoken
obtained ~rom the text in block 502. The three tokens
are then loaded in~o the register file 34-38 to b~ compared

~25~i!9
-70-
1 to the respective tokens in the triplets in the phrase
table. The first word of the potential phrase need not
be compared with any word in the phrase table since the
trial keytoken was already found to match a keytoken in
the keytoken list.
In block 514, the start address derived ~rom the
phrase table memory map, 101, and the segment length,
15, are loaded in the tokenizer. Specifically, a repre-
sentation of 101 is loaded into the token address counter
16-19 with an appropriate write address command. A repre-
sentation of 15 is loaded into the dictionary scan down
counter 54-56 with an appropriate write address command.
The seyment holding register 48 is loaded with a 2 bit
code representing a segment code S3 using an appropriate
address command. The segment code is S3 since there are
three tokens or thr0e grammar bytes t4 b~ compared in
comparators 29-33. It should be noted that the three
trial tokens representing the last three words in the
potential phrase are loaded into the register ~ile 34-38
in a manner similar to that described with respect to
tokenization in Fig. 8. This also allows for parallel
- processin~ resulting in shorter scan times. A start scan
command is then issued by the control microprocessor to
begin comparing the three tokens in the register file
34-38 with the token triplets in the phrase table starting
at starting addres.s 101. Specifically, the first token
in the register file is compared with the token represented
by Sl, the second token is compared with the token repre-
sented by S2 and the third token is compared with the token
represented by S3. The steps involved are similar to
those described with resp~ct to Fig. 8. If a match is
found so that compare latch 39 provides a data equals
code, DEQ, the tokenizer is stopped and the address in
the phrase table is used as the token for the phrase.
The control microprocessor reads the address in a manner

~Z5~
-71-
1 similar to that descrlbed above with respect to tokeniza-
tion. The control microprocessor then continues with the
next word after the phrase which was just tokenized.
A~ indicated in block 518, if a match is not found
betwsen the first trial token and Sl, the second trial
token and S2 or the third trial token and S3, the control
microprocessor computes a m~smory variable to dete~mine
whether or not the tokanizer compared thr~e trial tokens
with three tokens in the phrase table. I~ a comparison
was made of three tokens, and a match was not ~ound, block
520 is entered and the control microprocessor issues a
write command to the tokenizer 102a to null the bits repre-
senting the last token in the trial tokens being compared.
This is done by placing null characters in the last grammar
byte loaded into register file 34-38. In the present
example, where three trial tokens were being compared, two
trial tokens containing usable information ~or comparing
remain for comparison with the tokens represented by Sl
and S2. However, the control microprocessor still loads
the sagment holding register with a representation of S3.
This allows comparison by the tokenizer of all three tokens
represented by Sl, S2, and S3. The control microprocessor
then issues an address write command to load the dictionary
scan down counter with the starting address o~ the phrase
corresponding to the keytoken address 1, namely 101, and
the total number of grammar bytes, 5.
The search is conducted again throllgh the phrase
table in th dictionary ROM 1-10 using the representa-
tion of 101 as the starting address. The tokeniz~r com-
pares only the first two tokens in the register fil~ 34-~8
with the respective grammar bytes represented by Sl and
S2. The tokenizer determines whether or not th~re is a
match according to block 51~ as described abovP. If a
match is found, the phrase table address is read by the
control micropxocessor to be used in creating a token.
.' ' :
:` :

~o~
-72-
1 It should be noted in this case that the address and there-
fore the resulting token represents only thr~e language
words; namely the keytoken and the language words repre-
sented by Sl and S2. It should be noted even though the
phrase in the phrase table for which a match was ~ound in
this case only contains two tokens corresponding to Sl
and S2, the token corresponding ~o S3 contains only null
characters. As ~ result, the con~rol microprocessor re-
places only three words from the text, namely the keytoken
and the next two words, with the token derived from the
phrase table. Tokenization o~ words and o~ phrases then
continues as usual.
Returning to block 516, if a match is not ~ound/
the control microprocessor senses the data not found code
(DNF). The control microprocessor then tests its memory
variable to determine whether or not the memory variable
is greater than 1. I~ the memory variable is greater
then 1, this indicates that in addition to the keytoken,
there is at least one token in the phrase table the can
still combine with the keytoken to constitute a phrase.
The process is then repeated until a match is not found,
and the bits storing the token corresponding to S2 and S3
hav~ already been loaded with null characters. In such a
case, ths text word for which a keytoken was found will
Z5 merely be represented by the 16-bit token.
In the process described above, an entire tPxtual
block can be represented by tokens requiring significantly
less storage space or processing time than the ASCII or
EBCDIC equivalentO Many of the phrases in the textual
passage will be capable of being represented by a single
token, the remaining words being represented by their own
tokens.
Signi~icantly, the tokenizer can conduct searches or
scans very effic ently bec~use a complete operation, in
terms of a comparison with a match or rejection, can be

5i~V~
-73-
1 carried out in at least one system clock cycle and in, at
most, four clock cycles. This ~s in contrast to other
processors which require many clock cycles to complete a
given task. Additionally, the disclosed embodiments pro-
vide for sufficient flexibility to allow less than ~ourgrammar bytes to be processed in a given search, if the
trial word contains less than thirteen characters.
The disclosed method and apparatus also enhances
searching ef~iciency by limiting the search or scan range
only to that small portion of the dictionary ROM where
the word will appear, if at all. The scan limit down
counter stops the search and compare process after there
is no chance of finding a match. Also, all grammar bytes
of a word are compared, even though a match is no longer
possible, in order to properly increment the address coun-
ter for the next compare cycle.
It is also significant that parallel processing i6
used to simultaneously access language words in the dic-
; tionary ROM to compare the trial and dictionary words and to read and write data. This enhances the ef~iciency.
:;
,

~ll2S'5~1
--74--
TABLE III
A 02 - OOOS); OFFF
A 03 - OOFO; OOAO
A 04 - O.FFF; OO~SA
5 ~ .
A16 -
B 02 -
B 03 -
B 16 -
1~ . .
Z 16 - ODDD; OOFO
.
~ 35
.
~. ' , ,
: ' , ': '
~ ..

8~
The following is a listinq OI a basic program ~or
causing the symbolic tokenizer to check the dlctionary
ROM 1 through 10 for a word typed into the control micro-
processor 100 through its ke!yboard. If the word is not
5 found in the dictionary ROM 1 through 10, then the word
is highlighted after the rei:urn key i8 struck. If the
word is found, then the word is not highlightPd.
.
..... . .
'~ - .
. ' . , :
.

~;~5~
1 TABLE IV
1 PRINT CHR$(14); CHR$(5); CHR$(147)
2 POKE 53280,0: POXE 53281,0
IF FL=l THEN 25
FL=l: LOAD "SPELL.HC",8,1
GOSUB 1000
A$=""
B$=""
A=l: G-0
10 60 G=G~l
GET A$: IF A$="" THE:N 65
IF A$=CHR~(13) OR A$=CHR$(141) THEN 65
PRINT A$;: B$=B$+A$: GOSUB 1000
77 K=AS~(A$): IF K~127 THEN K-K-128
78 IF K=32 OR K=45 THEN 240
79 IE K ~65 OR K>90 THEN G=G-l:GOTO 100
POKE t52991+G), ASC(A$)
100 A$=''ll
110 A=A+l
120 GOTO 60
130 POKE 529gl,&
140 SYS 4918~
150 IF PEEK (52990)=1 THEN 33
160 PRINT CHR$(146);
170 GOSUB 2000
180 PRINT CHR$(18); B$; CHR$(146); : A$=""
190 GET A$: IF A$=~lll THEN 190
200 IF ASC~A$) <> 133 AND ASC (A$) ~> 137,
THEN A=l: B$="": G=l: GOTO 70
210 GOSUB 2000
:~ 220 GOSUB 1000
230 GOTO 30
: 240 IF A=2 THEN 30
250 GOTO 130
1000 PRINT CHR$(18)1' " CHR$(157~; CHR$(146);:RETURN
2000 FOR C=l TO A: PRINT CHR$~157);: NEXT C: REl'URN
~`

~25~
-77-
1 The following is a Commodore 6510 Assembly Language
listing o~ the code required to cause the control micro-
processor 100 (a Commodore 64X) to access the ~ymbolic
tokenizer when the above Basic program is run. The Commo-
dore 64K computer uses a 6S10 central processing unit
which has a similar assembly language instruction set to
the more popular 6502 central processing unik.
'
~ .

~ls~
--78-
1 TABLE V
MAP .WORD $C800
.WORD $C868
.WORD $C8Do
.WORD $C938
.WORD $C9Ao
.WORD $CAo8
.WORD $CA70
.WORD $CAD8
.WORD $CB40
.WORD $CBA8
.WORD $CC10
.WORD $CC78
.WORD $CCE0
.WORD $CD48
.WORD $DDBo
RFLG=52990
CCNT=RFLG+l
CSTK=CCNT+l
MEM=CSTK+40
LENGTH--MEM~l
SEGC=LENGTH+l
OUTSTK=SEGC+l
CNT5=OUTSTK+3
CNT8=CNT5+1
XSAVE=CNT8+1
ZFLAG=XSAVE~l
SHIFTC=ZFLAG+l
YCNT=SHIFTC~l
FLET=YCNT+l
ZPAGE=$FB
TKRST=56832
TKSAL=56836
TKS~=56840
3S TKSCL=56844
TKSCH=56848

~9L25S~
-79-
1 TABLE V (Contd.)
TKSEG=56852
TKRUN=56856
TKSTAT-56837
CLD :CLEAR DECIMAL
LDX CCNT ;GET CHAR~CTER COUNT
DEX ;LAST CHARACTER
LDA CSTK,X :GET CHARACTER
AND #$7F :FORCE LOWER CASE
STA MEM STUFF IT
SEC ;SET CARRY HI
SBC ~$5A ;SUBTRACT TOP OF RANGE
BCC NCHR :BR~NCH IF NOT CHARACTER
LDA #$40 ;GET BOTTOM OF RANGE
SBC MEM :GET CHARACTER
BCS NCHR~l ;BRANCH IF CHARACTER
NCHR DEX :LAST NOT LETTER
STX LENGTH ;LAST CHARACTER ADDRESS
; LDX #0 ;INDEX TO 0
20 RFMTLP LD~ CSTK,X ;GET FIRST LETTE~
EOR #39 ;TEST FOR APOSTROPHE
: BNE LETTER ;BRANCH IF ~OT
LDA #26 ;GET APOSTROPHE
STA CSTK,X :STUFF IT
JMP CET ;GET NEXT LETTER
LETTER SEC ;SET CARRY HI
LDA C5TK,X ;GET A LETTER
AND #$7F :MAKE IT LOWER CASE
SBC #64 ;CREATE 5 BIT C3DE
STA CSTK,X ;PU~ IT BACK
CET CPX LENGTH ;TEST FOR LAST LETTER
BEQ PACK :BRANCH IF RF~T DONE
INX ;SELECT NEXT LETTER
~ JMP RFMTLP ;GO REFO~MAT TO 5 BITS
35 PACK LDA CSTK ;GET FIRST CHARACTER
STA FLE~ :SAVE IN FIRST LETTER CELL

~s~
-80-
1 TABLE V tContd.)
LDA #O ;GET O
STA SEGC ;GET SEGMENT TO 1
STA TKRST :RESET TOKENIZER
STX XSAVE ;SAVE X
OUTCLR LDA #0 7GET O
STA CNT5 ;RESET SOURCE BIT COUNTER
STA CNT8 ;RESET DESTINATION BIT
COUNTER
lo STA ZFLAG :RESET ZERO LETTER FLAG
STA SHIFTC :RESET SHIFT COUNTER
STA OUTSTK ;CLEAR TRIAL BYTE A
STA OUTSTK+l ;CLEAR TRIAL BYTE B
STA OUTSTK~2 ;CLEAR TRIAL BYTE C
15 STA YCNT ;CLEAR PSEUDO Y
SLOOP LDX XSAVE ;GET X
LDA ZFLA5 ;GET ZERO ELAG
C~P #l ;SEE IF SET
CLC ;CLEAR CARRY
BEQ ZERO ;BRANCH IF SET
ROR CSTK,X ;SHIFT SOURCE TO CA~RY
STX XSA~E ;SAVE X
ZERO LDX YCNT :GET PSEUDO Y
ROR OUTSTK,X :SHIFT INTO TRIAL BYTE
LDX XSAVE , GET X BACK
INC SHIFTC ;INCREMENT BIT COUNT
LDA CNT5 ,GET SOURCE BIT CVUNT
CMP #4 ;TEST FOR BYTE EXHAUSTION
BEQ RST5 ;BRANCH IF EXHAUSTED
INC CNT5 ;INCREMENT SOURCE BIT COUNT
JMP ~TST ;GO CHECK TOTAL BITS
RST5 LDA #O ;GET ZERO
STA CNT5 ;CLEAR SOURCE BIT COUNT
CPX #O ~ ;SEE IF hAST LETTER OUT
BNE NXTLET ;BRANCH IF NOT
LDA #I ;GET 1

~.2sis~
-81-
1 TABLE V ~Contd.)
STA ZFLAG ;SET ZERO LETTERS FLAG
JMP BTST ;GO CHECK TOTAL BITS
NXTLET DEC XSAVE :SELECT NEXT LETTER
BTST LDA SHIFTC ;GET SHIFT COUNT
CMP #20 :TEST FOR 20 BITS SHIFTED
BEQ GBOUT ;OUTPUT THE GRAMMAR BYTE
LDA CNT8 ;GET DESTINATION BIT COUNT
CMP #7 ;TEST FOR BYTE EXHAUSTION
BEQ NXT8 ;BRANCH TO GET NEXT BYTE
INC CNT8 ;INCREMENT DESTINATION BIT
COUNT
~MP SLOOP ;GO SHIFT MORE BITS
NXT8 LDA #O ;GET ZERO
STA CNT8 ;CLEAR THE DESTINATION COUNT
INC YCNT ;INCREMENT PSEUDO Y
JMP SLOOP ;GO SHIFT MORE BITS
GBOUT ~LC ;CLEAR CARRY
STX XSAVE ;SAVE X
LDX YCNT ;GET PSEUDO X
ROR OU~STK,X ;MOVE LAST 4 BITS DOWN
ROR OUTSTK,X
ROR OUTSTK,X
ROR OUTSTK,X
LDY #O :GET O
LDA SEGC ;GET SEGMENT COUN~
CMP #O :CHECK FOR O
BEQ OFFO ;BRANCH IF SEGMENT 1
CMP #l ;rHEcK FOR 1
BEQ OFFl ;BRANCH IF SEGMENT 2
CMP #2 ;C~ECX FOR 2
BEQ OFF2 ;BRANCH IF SEGMENT 3
LDX #13 ;~ET S4 OFFSET
JMP DOIT
35 OFFO LDX #l ;GET Sl OFFSET
JMP DOIT
. .

~;5`~
-82-
1 TABLE V (Contd.~
OFFl LDX #5 ;GET S2 OFFSET
JMP DOIT
OFF2 LDX #9 ;GET S3 OFFSET
DOIT JSR DOUT ;OUTPUT TRIAL BYTE A
JSR DOUT ;OUTPUT TRIAL BYTE B
JSR DOUT ;OUTPUT TRIAL BYTE C
JMP NXGB :SEE IF MORE GRAMMAR BYTES
DOUT LDA OUTSTR,Y SGET BYTE FROM STACK
STA TKRST,X ;OUTPUT TO TOKENIZER
INX NEXT TRIAL ADDRESS
INY ;NEXT STACK ADDRESS
RTS :RETURN
NXGB LDX XS~VE ;RESTORE X
CPX #O ;SEE IF DONE
~EQ SEGLD ;BRANCH IE WORD LOADED
INC SEGC ;SELECT NEXT SEGMENT
LDA SEGC
CMP #4
BEQ ABORT ;BRANCH IF MORE THAN 16
LETTERS
DEC XSAVE ; SELECT NEXT LETTER
JMP OUTCLR ;GO LOAD NEXT SEGMENT
ABORT LDA #l ;OE T OK CODE
STA RF$G : PASS IT TO BASIC
RTS ;RETURN TO BASIC
~ SEGLD LDA SEGC ;GET FINAL SEGMENT
- STA TKSEG ;PUT IN TOKENIZER
DEC LENGTH ;OFFSET WORD LENGTH
LDA LENGTH ;GET WORD LENGTH
ASL A ;MULTIPLY BY 2
TAX ;PUT I~ X
LDA MAP,X ;READ THE MAP LO BYTE
STA ZPAGE ,SAVE IT
IN:K ;ADVANCE TO HI BYTE
LDA MAP,X :READ THE MAP HI BYTE
'

~s~
-83-
1 TABLE V (Contd.)
STA ZPAGE~l ;SAVE IT
DEC FLET ;BIAS THE LETTER CODE
LDA FLET ;GET THE LETTER OF THE WORD
ASL A ;MULTIPLY BY 4
ASL A
TAY ;USE AS SUB-MAP INDEX
: LDA (ZPAGE),Y ;GET START ADR LO
STA TKSAL ;OUTPUT TO TOKENIZER
INY ;ADVANCE ~NDEX
LDA (ZPAGE),Y ;GET START ADR HI
STA TKSAH ;OUTPUT TO TORENIZER
INY ;ADVANCE INDEX
LDA (ZPAGE),Y ;GET SCAN LIMIT LO
STA TKSCL ,OUTPUT TO TOKENIZER
INY ;ADVANCE INDEX
LDA (~PAGE),Y ;GET 5C~N LIMIT HI
STA TKSCH ;OUTPUT TO TOK~NIZER
LDA #0 ;GET 0
STA TKRUN ;START THE SCAN
TX~AIT LDA TKSTAT ;GET TOKENIZER STATUS
~ND #$30 ;MASK OFF UNWANTED BITS
CMP #0 ;CHECK FOR BUSY
BEQ TKWAIT ;WAIT FOR DONE
CNP #$20 :TEST FOR DATA NOT FOUND
B~Q BAD ;BRANC~ IF NOT FOUND
LDA #l ;GET FOUND CODE
DONE STA RFLG ;PASS IT TO BASIC
` RTS ;RETURN TO BASIC
; 30 BAD LDA #0 ;GET DATA NOT FOUND CODE
J~P DONE ;GO PASS IT ON
; .LXB MAP,SR
,
,

~2S~
-84-
The ~ollowing is a sample page from the memory map of
the clictionary ROM 1 through 10 as stored in the control
microprocessor 100 (a CommodoreTM 64K). This memory map
is a portion of the AssembLy Language program listed
above. It tells the control microprocessor 100 where the
beginning adclress for each sector is and how many words
are in each sector. This information is used to load the
address counter 105 (in FIG. 1) and the scan limit down
counter 108 (in FIG. 1) as discussed earlier.
The map entries are listed in groups of two. The
first entry is the starting address for each sector, and
the second entry is the number of grammar bytes in that
sector.
For example, in .WORD $FFFF below, FFFF is the
starting address of a sector. It is the sector containing
words starting with "A" and being two characters in
length. In .WORD 13, which follows .WORD $FFFF, 13 is the
number of grammar bytes in that sector. In .WORD 11, 11
is the starting address of another sector. This sector
contains words starting with "B" and which are also two
characters in length.
.
.

-85-
TABLE' VI
*=$C800
.WORD $FFFF
.WORD 13
.WORD 11
.WORD 4
.WORD 14
.WORD 3
.WORD 16
.WORD 4
.WORD 19
.WORD 7
.WORD 25
.WORD 2
.WORD 26
.WORD 2
.WORD 27
.WORD 5
.WORD 31
`:

86-
1 The following is ~ list of the designated components
of the symbolic tokenizer:
:```
:;

~2~
-87-
1 TABLE VII
1-10 Dictionary ROM FIG. 2
11-15 Data Multiplexers "
16-19 Token Address Counter
20-22 Digital Inverters "
23 Non-lnverting Buffer
24 Deleted "
25-26 Dictionary Enable AND Gates
27 DELETED l~
10 28 GBLOCK Inverse AND Gate FIG. 3
29-33 Comparators
34-38 Register File
39 Inter Grammar Byte Compare Latch
40 Scan Termination Logic
15 41~45 Tokeniæer Output Multiplexer "
46 Bi-Directional Data Bus Transceivex "
47 Segment Up Counter FIG. 4
48 Segment Holding Register 2-bit register
~- Dual Flip~Flop
20 49 Segment Down Counter 1.
Address Decoder
51 Scan Control Logic "
52 Address Decoder
53 Address Select
25 54-56 Dictionary Scan Down Counter "
57 Dip Switch "
58 Address Bu~fer
59 Register File Select 1'
Inverter "
30 61 Inverter '~
62 Inverter "
63 Inverter "
- 64 Inve:rter
~: 65 AND Gate
3s ~ND Gate "

~zs~
-sa-
1 TABLE VII (Contd.)
67 AND Gate
68 Inverter FIG. 3
69 Inverter "
N~ND Gate "
100 Control Microprocessor FIG. 1
101 Interface ll
102 Mailbus "
103 Compare Logic "
l0 104 Dictionary ROM "
105 Address Counter
106 Segment Register "
107 Segment Counter
108 Scan Limit Down Counter
109 Control Logic
110 Status Register
111 Trial Word Register
112 Grammar Bus "
113 Token Address Bus
114 Internal Mailbus
115 External Mailbus
116 Mailbus Address Lines "
118 Lines Between Figs 3 and 4 FIGS. 3 ~ 4
120 " ll l - "
: 25 122 .. "
12~ " "
126
128 " " 1~
201 Commodore 64K Connector FIG. 5
30 202 Bus Transceiver Enable NAND Gate
203 Mailbus I/O N~ND Gate ~J
204 Bidirectional Bus Transceiver
205-206 Buffers

Representative Drawing

Sorry, the representative drawing for patent document number 1255809 was not found.

Administrative Status

2024-08-01:As part of the Next Generation Patents (NGP) transition, the Canadian Patents Database (CPD) now contains a more detailed Event History, which replicates the Event Log of our new back-office solution.

Please note that "Inactive:" events refers to events no longer in use in our new back-office solution.

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Event History , Maintenance Fee  and Payment History  should be consulted.

Event History

Description Date
Inactive: Expired (old Act Patent) latest possible expiry date 2006-06-13
Grant by Issuance 1989-06-13

Abandonment History

There is no abandonment history.

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
DATRAN CORPORATION
Past Owners on Record
DALE E. WINTHER
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Drawings 1993-10-06 21 899
Claims 1993-10-06 15 567
Abstract 1993-10-06 1 14
Cover Page 1993-10-06 1 15
Descriptions 1993-10-06 89 3,672