Patent 2458199 Summary

(12) Patent Application:	(11) CA 2458199
(54) English Title:	METHOD FOR THE TRANSLATION OF PROGRAMS FOR RECONFIGURABLE ARCHITECTURES
(54) French Title:	PROCEDE PERMETTANT LA CONVERSION DE PROGRAMMES DESTINES A DES ARCHITECTURES RECONFIGURABLES
Status:	Deemed Abandoned and Beyond the Period of Reinstatement - Pending Response to Notice of Disregarded Communication

Bibliographic Data

(51) International Patent Classification (IPC):	G06F 7/00 (2006.01) G06F 9/30 (2018.01)
(72) Inventors :	VORBACH, MARTIN (Germany) MAY, FRANK (Germany) NUECKEL, ARMIN (Germany)
(73) Owners :	PACT XPP TECHNOLOGIES AG
(71) Applicants :	PACT XPP TECHNOLOGIES AG (Germany)
(74) Agent:	SMART & BIGGAR LP
(74) Associate agent:
(45) Issued:
(86) PCT Filing Date:	2002-08-16
(87) Open to Public Inspection:	2003-02-27
Availability of licence:	N/A
Dedicated to the Public:	N/A
(25) Language of filing:	English

Patent Cooperation Treaty (PCT):	Yes
(86) PCT Filing Number:	PCT/EP2002/010065
(87) International Publication Number:	WO 2003017095
(85) National Entry:	2004-02-13

(30) Application Priority Data:

Application No.	Country/Territory	Date
09/967,847	(United States of America)	2001-09-28
101 39 170.6	(Germany)	2001-08-16
101 42 903.7	(Germany)	2001-09-03
101 44 732.9	(Germany)	2001-09-11
101 45 792.8	(Germany)	2001-09-17
101 54 260.7	(Germany)	2001-11-05
102 07 225.6	(Germany)	2002-02-21
PCT/EP02/02398	(European Patent Office (EPO))	2002-03-05
PCT/EP02/09131	(European Patent Office (EPO))	2002-08-15

Abstracts

English Abstract

The invention relates to data processing with multidimensional fields and high-
level language codes which can be used advantageously therefor .

French Abstract

La présente invention concerne le traitement de données avec des champs multidimensionnels, et les codes de langage de programmation évolué qu'il est avantageux d'utiliser à cet effet.

Claims

Note: Claims are shown in the official language in which they were submitted.

What Is Claimed Is:
1. ~A method for translating high-level languages onto
reconfigurable architectures,
wherein a finite automaton is constructed for the computation
in such a way that a complex combinatory network of individual
functions is formed and memorial are assigned to the network
for storage of the operands and the results.
2. ~A method for data manipulation and/or data processing
having a multidimensional field using reconfigurable ALUs,
wherein a high-level language code is provided and translated
in such a way as to construct a finite automaton for the
computation, a complex combinatory network being formed from
individual functions and memories being assigned to the
network for storage of the operands and/or results.
3. ~The method as recited in one of the preceding claims,
wherein the complex combinatory network is constructed and/or
broken down in such a way that the PAE matrix is operated for
the longest possible period of time without reconfiguration.
4. ~The method as recited in the preceding claim,
wherein complex instructions are determined to construct the
complex combinatory network and/or break it down in such a way
that the PAE matrix is operated for the longest possible
period of time without reconfiguration.
5. ~The method as recited in one of the preceding claims,
wherein a finite automaton is constructed directly from
imperative source text.
6. ~The method as recited in one of the preceding claims,
wherein a finite automaton is constructed from operations

adapted to coarse-grained logic circuits and/or existing fine-
grained elements (FPGA cells in the VPU, state machines,
etc.), in particular being adapted exclusively to such
elements.
7. The method as recited in one of the preceding claims,
wherein a finite automaton is then broken down into
configurations.
8. The method ae recited in one of the preceding claims,
wherein generated configurations are mapped successively onto
the PAE matrix and operating data and/or states that are to be
transmitted among the configurations are stored in memories.
9. The method as recited in the preceding claim,
wherein the memory is provided and/or determined by the
compiler,
10. The method as recited in the preceding claim,
wherein during a configuration, data from a VPU-external
source and/or an internal memory is processed and written to
an external source and/or an internal memory.
11. The method as recited in one of the preceding claims,
wherein a memory is provided for an entire data record which
is more extensive than a single data word.
12. The method as recited in one o~ the preceding Claims,
wherein data is stored in the memories as determined by the
compiler during the processing of a running configuration.
13. The method as recited in one of the preceding claims,
wherein a memory for operands, a memory for results and a
network of assignments and/or comparison instructions, i.e.,
66

conditions such as IF, CASE, loop (WHILE, FOR, REPEAT) and
optional address generator(s) are provided for triggering the
memories with the automaton.
14. The method as recited in one of the preceding Claims,
wherein states are assigned to memories as necessary,
differentiating here between algorithmically relevant and
irrelevant states. in particular such relevant states which
are necessary within the algorithm to describe its correct
function and such irrelevant states which occur due to the
hardware used and/or the selected mapping or for other
secondary reasons.
15. The method as recited in one of the preceding claims,
wherein load/store operations are provided, including external
addressing, i.e., data transfers to external modules and/or
internal addressing, i.e., data transfers between PAEs, in
particular between RAM pAEe and ALU PAES.
16. The method as recited in one of the preceding claims,
wherein a first configuration is removed during data
processing and data to be saved remains in the corresponding
memories (REG) (memories, registers, counters, etc.).
17. The method as recited in one of the preceding claims,
wherein the first configuration ie reloaded and the previously
saved data assigned to is is accessed.
18. The method as recited in one of the preceding claims,
wherein a second configuration is loaded for access to
previously saved data, this second configuration connecting
the REGs in a suitable manner and in a defined order to one or
more global memories, in particular to access the global
memory (memories) using address generators, the address
67

generator generating the address for the global memory
(memories), preferably in such a way that the memory areas
described (PUBH AREA) of the first configuration removed may
be assigned unambiguously,
19. The method as recited in one of the preceding claims,
wherein transformation for representing the parallelizability
and/or vectorizability (par-vec transformation) is performed
automatically and/or VEC and PAR components are designed as a
Petri network to control further processing after complete
processing of the particular contents as preferred.
20. The method as recited in one of the preceding claims,
wherein arithmetic/logic instructions are mapped directly into
the combinatory network and/or
jumps (jump/call) are either rolled directly into the
combinatory network and/or are implemented by reconfiguration
and/or
conditions and control flow instructions (if, etc.) are either
completely resolved and/or processed in the combinatory
network and/or relayed to a higher level configuration unit
which then performs a reconfiguration according to the
resulting status and/or
load/store operations are mapped into separated configurations
and/or are implemented by address generators which write
internal memories (REG{}) to external memories via address
generators and/or they load them from external memories and/or
peripherals and/or
register move operations are implemented in the combinatory
network by buses between the internal memories (REG{}) and/or
push/pop operations are implemented by separate
configurations, writing certain internal registers in the
combinatory network and/or writing the internal memories
(REG{)) via address generators to external memories or reading
68

from external memories and which are preferably executed
before or after the actual data processing configurations.
69

Description

Note: Descriptions are shown in the official language in which they were submitted.

CA 02458199 2004-02-13
MPthnd. for Translating Prog,ams for Reconfigurable
Architectures
1. In.txoduetion
The present invention rel tes to that which is claimed in the
definition of the species. The present invention thus relates
S to the question of how re onfigurable architectures may be
opt~.mally used and in pax icular how ira,structions in a given
high-level language may b optimally executed in
reeonfigurable architectu es.
To implement execution of instructions for hax~dling data
(programs? that are writs in high-level languages in a
paxticular architecture us d for data handling, there are
known compilers which tran late the ~.natructiona of the high-
level language into instru tions that ar~ better adapted to
the architecture used. Com filers which support highly parallel
architectures in particuls axe thus paralleli2ing Compilers.
Parallelizing compilers ac ording to the related art normally
use spec~.al constructs sue as semaphores and/or other methods
for eyrschronizatiox~. Techn logy-specific methods are typically
used_ tcnown methods are na suitable for combining
functionally specified areiitectures with the particular
dynamic response and imperatively specified algorithms. The
methods used therefore yie d satisfactory rGSUlts only in
NY01 007125 V 1

CA 02458199 2004-02-13
special cases.
Compilers for reconfigura 1~ architectures, in particular fox
reconfigurable processors generally use macros created
specifically for the into ded reconPigurable hardware, mostly
using hardware desariptio languages such se Verilog, VHDL, pr
System~C to creatQ the ma ras. These macros are then called up
from the program flo'w~ (in tantiated) by an ordinary high-level
language (e.g. , C, C++) .
There are known compilers~for parallel computers which map
program parts onto multip a processors on a coarse-grained
structure, mostly basad o complete functions or Lhreads.
In addition, there axe kn vectorizing compilers which
convert extensive linear to processing, such as computations
of large expressions into vectorized form arid thus permit
computation on superscala processors and vector processors
(e. g., Pentium, Cray),
A method for automatic madding of Functionally or imperatively
formulated computation pr educes on different target
technologies is described ere, in particular on ASTCe,
reconfigurable modules (F As, DPGAs, VPUs, Chess Array, Kresa
Array, Chameleon, etc.; h einafter combined under the term
VpU), sequential processor (CIBC-/RZSC-CPUs, DSPs, etc.;
hereinafter summarised by he term CpU) and parallel computer
systems (SMP, MMP. eta.). t1 this connection, referena~ is
made in particular to the ollowing patents and patent
applications by the prose applicant: P 44 16 881.0-53,
DE 197 81 412.3, DE 197 81483.2, DE 196 54 846.2-63,
DE 196 54 593.5-53, DE 197:04 044.6-53, DE 198 B0 1.29.7,
DE 198 61 088.2-53, DE 199 BO 312.9, PCT/DE 00/01869,
DE 100 36 627_9-33, DE 100 28 397.7, DE 101 10 530.4,
NY01 90T128 V 1 ~~ 2
. .... . . a,~... .. _ . . . . . _ . . .

CA 02458199 2004-02-13
DE 101 11 014.6, PCT/EP 0 /10516, EP 01 102 6~g.7, PACT13,
PACT17, PACT18, PACT22, CT24, PACT25, PACT26U8, PACT02,
PACT04, PACTOB, PACT10, p CT15, PACT18(a), PACT 27, FACT19.
These are hereby fully in orporated into the present patent
application for disclvsur~ purposes.
VPUs are basically composed of a multidimen~ional, homogeneous
or inhomogeneous, flat or hierarchical configuration (&A) of
cells (Pa's) which are ab a to execute any functions, in
l0 particular logic function and/or arithmetic functions and/or
memory functsons and/or n twork functions. PAEe are typically
assigned a loading unit ( T) which determines the function of
the PAEs by configuration and optionally raconfiguration_
This method is based on axj~ abstract parallel machine model
which in addition to the 'trite automaton also integrates
imperative problem specif ations and permits an efficient
algorithmic derivation of ' n implementation on different
technologies.
The following compiler classes are knotnm according to the
related a.rt
Classical compilers, whiC often generate stack machine code
and are suitable for very imple processors which are
essentially designed ae n al sequencers (see N. Wirth,
Compiler Design, Teubner rlag).
vectorizing compilers cons ~uc~ tnOfitly linear. Code which is ... . .
intended for special vecto computers yr for h~.ghly pipeliried
processors. These compiler were originally available for
vector cvmput~rs such as t a Cray_ Modern proeeccors such as
Pentium processors require. similar methods because of the long
pipeline structure. Since he individual computation steps are
NYOt 887128 v 1

CA 02458199 2004-02-13
performed ss vectorized ipelined? steps, ache code is much
more efficient. However, he coz~ditional jump means problems
for the pipeline. Theref e, a jump prediction is appropriate,
which assumes a jump des nation. rf this assumption is
incorrect, however, the tire processing pipeline must be
deleted. In other words, ach jump for these compilers is
problematical and actuall there is no parallel processing.
Jump predictions and simi or mechanisms require a considerable
extra complexity in terms~of hardware.
io
There are hardly any coar e-grained parallel compilers in the
actual sense, parallelism typically being marked and
administered by the programmer or Lhe operating system, e.g_,
it is usually performed o~ a thread level in MM8 Computer
Systems Such as ~crarious T M architectures, ASCI Rod, etc. A
thread is a largely indep ndent program block or e~ren a
separate program. Therefo e, threads are easily parallelized
on a coarca-grained level Synchronization and daCa
consistency must be ensuY by the programmer or the operating
System. This is complex 1; program and reguires a significant
portion of the computativ power of a parallel Computer. In
addition, only a fragment f the parallelism which is actually
possible is in fact usabl due to this coarse parallelizstivn.
Fine-grained' parallel c~om~~.lers (e.g. , VLIW), attempt to map
the parallelism in a fine rained farm in 'VLZW arithmetic
means that are capable of xecuzing a plurality of computation,
operations in one clock c 1e but have a common register set.
This limited register set s a significant problem because it
must provide the data far '11 the computation operations. In
addition, the data depends cies and inconsistent read/write
operations (LOAD/STORE) ma a paralleli2ation difficult.
Reeonfigurable processors eve a large number of independent
arithmetic units, typicall located in a field. These sre
NY01 667128 v 1 ~j 4

CA 02458199 2004-02-13
typically interconnected y buses instead of by a common
register set. Therefore, ector arithmetic units are easily
constructed, while it is leo poscibl~ to perform simple
parallel operations. In ntrast to traditional register
S concepts, data dependenci s are resolved by the bus
connections.
The object of the preserit~ir~vention is to provide something
novel for commercial applI~ICation.
The method for achieving his object is the characterixed in
the independent patent c1 ims. Preferxed embodiments are
characterised in the sube,~aims.
Tt is therefore propos~d j~hat the aoncepte of vectorizing
compilers and parallelizi g compilers (e.g., vLxw) shall be
used at the same time for reconfigurable processors and thus
vectorisation and paralle.ixation shall be performed on a
fine-grained level.
An important advantage is~~that the compiler need not map onto
a fixedly predetermined h dware structure, but instead the
hardware structure may be ~ onfigured using the method
according to the present vention so that it is optimally
suitable far mapping the rticular compiled algorithm_
The finite automaton is used as the basis for processing
practically any method fo specifying algorithms. The
structure of a finite auto aton is shown in Figure 1. A simple
finite automaton is broken down into a combinatory network and
a register stage for tempo ary storage of data between the
individual data processing cycles.
NV01 897129 v 7

CA 02458199 2004-02-13
A finite automaton execut s a number of purely combinatory
data manipulations (i.e., logic andfyr arithmetic) arid then
reaches a stable state w eh is represented in a register
(set). Based on this stab a state, a decision is made a.s to
which next state is to be reached in the next processing step
and thus also which combi story data mar~ipulativns are to be
performed in the next stem.
For example, a finite aut maton is represented by a processor
or aequenoer. rn a first° rocessing øtep, subtraction of data
may be performed. The result ie stored. In the next step, a
jump may be performed, be ed on the result of th~ subtraction,
resulting in more further~processirig, depending on the plus or
minus sign of the result.
1, 5
The finite automaton perm.ts mapping of oomplex algorithms
onto any sequential machi s, as illusCrated iri Figure 2. The
complex finite automaton picted here is made up of a complex
combinatory network, a me ry for storing data, and an address
generator for adBressirig a data in the memory.
Any sequential program ma fundamentall~r be interpreted as a
finite automaton, but usu 1y the result is a very large
combinatory network. In p .gramming classical "von Neumann"
architectures - i.e., with all CPUs - the Combinatory
operations are therefore oken down into a sequence of simple
individual fi~cedly predet mined operations (opcadea) on CPU-
internal registers. This eakdown results in states for
control of the combinator operation, which is broken down
into a aequenoe, but thc~sc states are not present per se
within the original combin tory operation or are not required.
Therefore, however, the st tes Of a von Neumann machine that
are to be proce9sed differ fundamentally from the algorithmic
states of a combinatory ne work, i.e., from the registers of
NY01 4H~128 v 1 ~~ 6

CA 02458199 2004-02-13
f ir~ite automatons .
Tt has rlow been recogni2 ' that th~ VPU t~chnology~ (such as
that definee es3entially y some or all of the publications
PACT01, PACT02, PACT03, CT04, PACT05, PACT08, PACT10,
PACT13, PACT17, PACT18, CT22, PACT24, which are herewith
fully incorporated into t a present patent application through
this reference) in contra t with the rigid opcodes of CpUs
makes it possible to comp 1e complex instructions according to
an algorithm to be mapped as in flexible configurations.
In the operation of the c mpiler, it is particularly
Z5 advantageous if the x do a are g~nerated so that
they are executed for the longest possible period of time in
the PAE matrix without _re vn i~uratiori.
Iri addition, the compiler a ex' a finite automa~~r,?n
2o preferably from the imper ive source text in such a way that
it ie particularly wmll a pted to the partl,Cular PAE matrix,
i.e., operations which t 'cally use coarse-grained logic
circuits tALUs, etc_), if ecessary also fine-grai.r~,ed elements
that are also present (FP~ Cells iri the VPU, ztate maohines,
25 eto.) particularly e~fici tly, are provideQ.
The compiler-generated fir~~te automaton is then broken down
into cox~EiguraLions I.
30 The processing (interpret ion) of the finite automaton is
performed on a VpU in sue a way that the generated
configurations are mapped ucceSSively onto the PAE matrix and
the operating data and/or states which are to be transmitted
between the configurations are stored in the memory. The
NY01 667128 v 1

CA 02458199 2004-02-13
method known from PACT014 nd/or the corresponding architecture
may be used to this end,. his memory is determined or provided
by the compiler. A confli ration here represents a plurality
of instructions: at they s me time, a configuration determines
the operation of the PA~ azrix for a plurality of cycles. and
during these oyolea a c~er sin amount of data is processed in
the matrix. This data abm s from a VPLT-external source and/or
an internal memory and ~a written to an external source and/or
an internal memory. Thel~i ternal memories here replace the
register set of a CPU a~c rdixig to the related art in such a
way that, for example, ~ egist~r ie represented by a memory
so that instead of one ~a a word per register, a full data set
is stored per memory. I~
It may also be essential hat the data and/or states of the
processing of a configu~a ion that is running are stored in
the memory in a manner qle rmined by the compiler and are thus
available for the next I~o iguration to be run.
2o A significant difterenc om compilers which parallelize on
the basis of instxuctiors s Lhus that Lhe method configures
and reconfigures the p trix in such a way that a
configured sequence of o inatory networks is emulated on a
VPU while traditional c~m 'lers combine loaded sequences of
instructions (opcodes>,~w re one instruction may be
considered a combirsator~ twork.
The functioning of the qom~iler is illustrated below on the
basis of a simple langua~g 'as an example. This assumes a
language the principles ~of which are already known, but a
known publication [refer~n a "Armin Ntiekel dissertation"7
describes only the mappi~g~of a function onto a static
NY01 007128 V 1
. . ...~.....ia._ . .... .. . . . . . . . .. . . .

CA 02458199 2004-02-13
combinatory network, whe as with the present invention, the
mapping is performed ont configurations vohi,ch are then mapped
on the PAE matrix in a c~ onoZogical ordex according to the
algorithm and the states hat result during processing.
'the programming language . ssumes that there exists a "WH2LE"
instruction in addition simple logical and/or arithmetic
links, and this instrucC n is defined by the following
~yntax:
WHZhE ...
possible constructs era thus: Znstruetian
sequence of instructions
Lovp
An instruction or a segue ce of instructions is mappable onto
a combinatory network usi;g the compiler method described
here.
Figure 3a shows a combina~ ry network and the pertinent
variables. The content of ne and the same variable (e.g., x1)
may change fxom one step 301) of the network to the next
step (0302).
This change is depicted i~ Figure 3b for the assignment
xl _ xl + 1.
For addressing for readiri the operands and for storing the
results, address generato may now be synchronized with the
combinatory network of th as9ignment. With each variable
processed, corresponding w addresses are generated for
operands and results (P'igu a 3c). In principle, any type of
address generator may be a ed, depending on the addressing
schemes of the compiled ap lication. Sharod, combined, or
NY01 867128 v 1 ~ ~ 9

CA 02458199 2004-02-13
completely independent address generators may be implemented
for operands and re9ulte I~.
Typically a plurality of ate is processed within a certain
configuration of PAEs in he data processing as in the present
data processing model. 'Th refore the compiler is preferably
designed for the simple F FO mode which is possible in many if
not most applications ar~d:i.s applicable at least for the data
memories, which are used 'ithin this description for storing
data and states of the da a processing (more or less as a
replacement for a traditi nal register set of con~rentional
CPVs). In other words, me cries are used for temporary storage
of variables between cvnf gurations. Here again, a
configuration ie like an nstruction of a normal processor,
and the memories (in part~eular a plurality of mwmoriee) are
comparable to the registe set of a normal processor.
2
2o A sequence of the exempla assignment may be generated as
follows (Figure 4a):
x1 :~ 0;
WHIhE TRUE IaO
x1 . x1 * 1;
This sequence may now be pped via an assignment according to
2.2.1 and address generat a for operands and results.
Finite senuences
For the sake of thvrvughn~s, a particular embodiment of
saqu~nees apart from the fined constructs of the W~iILE
language will now be disc sed. A finite sequence of the
exemplary assignment may generated as follows:
FOR i:< 1 TO 10
x1 :a XZ + 1l
NYO~ eonza v ~ ~ III 10
,::

CA 02458199 2004-02-13
Such a sequence may be it lemented by two methods:
a) ey generating an ad er fox computing i according to the
WHILE construct (se 2.2.4) and another adder for
computing x1. The s uence is mapped as a loop and is
Computed iterativel (Figure 5a).
b) sy rolling out the, op, so that the oomputation of i as
a function is elimi ~ Led. The computation of x1 is
instantiated i time and is constructed as a pipeline,
ZO resulting in i adds connected in series (Figure 5b).
2.2.4 Co itions
Conditions may be expressed via WHILE. For example:
x1 . 0:
WHILE X1 < 10 DO
x1 := x1 + l;
The mapping generates an dditional PAE for processing zhe
comparison. The result of the comparison is represented by a
status signal (see Trigge in PACT08), which is analyz~d by
the instruction processin PAEs and the address generators.
Figure 4b shows the resulting mapping.
ay analysing the conditio (wiIILE here, in general sad
reasonably also other iris uvtzone such as IF; CASE
implementable), a status i generated which may be made
available to the followin data processing (DE 197 04 728.9)
and/or may b~ sent to the T or a local load control
(DE 196 54 846.2) which t ' ri deri~rep information therefrom
regarding the further pro am flow and any pending
reconfigurations
NY01 667128 v 1 ~~ 2~

CA 02458199 2004-02-13
According to the previou method, each program is mapped in a
system having the follow' g structure:
1. Memory for operands (comparabl~ to the register of a CPS')
2. Memory for resulCS omparable to the register of a CP~f)
s 3. Netr~ork of a) assig ants and/or b) compare instructions,
i_e_, conditions ~u ac IF, CR$~1, loops (WHILE, FOR,
REPEAT)
4. Optional address ge retorts) for triggering the memories
according to 1 and 2
2.2_6 Handling of atatg~,
For the purposes of the compiler described here, a distinction.
is now made between state that arc relevant to the algorithm
and those that are not_ S ates are usually depicted in the 'V'pU
technology by staLUs sign is te.g., trigger in PACT08) and/or
handshake~ (e.g., ~x/ACK in PACT02). In general, states (in
particular in other techn logiea, e.g., FPC3As, DPGAs,
Chameleon modules, Morphi s, etc.) may be represented by any
2p signals, signal bundles a /or registers. The compiler method
disclosed here may also b applied to such technologies,
although essential parts the description are preferentially
focused on the VPU_
Relevant states are nocesry within zhe algorithm to describe
its correct function. The are essential for the algorithm.
zrrelwant scales occur d to the hardware used and/or the
selected mapping or for:o er secondary reasons. They are thus
essential for the mapping i.e., the hardware).
Only the relevant states st be obL3in~d with the data.
Therefore, these states a stored together with the data in
the memories because they cour either as a result of the
NY01 667128 v 1 ~ ~~ 12

CA 02458199 2004-02-13
processing with the data r they are necessary as operands
with the data for the ~e~processing cycle.
Irrelevant states, how~v , are necessary only locally and/or
chronologically locally ad therefore need not be scored.
Example:
a) Tha state informatio of a Comparison is relevant for
further processing o data because it determines the
functions to be e~eec fed.
b) zt is assumed that a sequential divider is formed, e.g.,
by mapping a divisio instruction onto hardware which
supports only seq,ien ial division. This results in a
state which characte ixee the computation step within the
division. This state is irrelevant because only the
result (1.e., the~di ision that is performed) is
necessary for the:'a1 orithm. Thus, only the resul>r and
the time infcrmat~on (i.e., the availability) are
recauired in this gas The compiler preferably
differentiates beC:we such relevant and irrelevant
states. '
The time information is a~eessible, for example, via the
RDY/ACK handshake in VPU chnology according to PACTOl, 02,
13. However, it should be ointed out here chat Lhe.handshake
also does not represent a elevant state because it only
signals the validity of t data so this in turn reduces the
remaining relevant infoYm ion to the existence of valid data.
2.2_7 xandlina of time
n
In many programming langu es, in particular in sequential
languag~s such as C, an;e et chronological sequence is
determined implicitly by a language. In cequ~zzirial
NY01 667128 v 1

CA 02458199 2004-02-13
programmin3 langu~agps, thi:~ is accomplished, for example, Ly
t.hP sPCyence of individual iinstructiciis . 2t this is raquired
by Lrie pi'Uc~xaiQm111g language and/or ~.thc algorithm, the compiler
method is executed in cuch~a way that the time inf~rmat.;~n may
be mapped onto syn~hr~ni~:at~ion models such as kDY/ACK and/~w
lt~:U/ACx or a time scamp mel~h~c3 according to DE 1o1 1~ 530.4.
zn other word, the implicilt time information ~f sPC~uential
languages is mapped in a h~nr3~hake protocol ~.n such a way that
the riandshake protocol (RD~'/ACh ~rwLvc:ol) transmits the time
inf~rnnd~.lUll Clllll in partiau7~ar ensuxes the acquence of
inetructionc.
For example, the followinq~FOR lUUp is ruu and iterated only
wlmr~ oho variable input steam is acknowledged using a RDY for
each run. If there i~ no Y, the loop run is stnPp~c~ »n>:il
RDY arrives.
while TRUE
2O i ~ O
for i:_ 1 to
s . s t inpuzstyedw;
The sequential language property of teeing r'nnr.rolled only by
instruction rrn~RSSi ng is 7~hus linl~ed in compillllc~ l.U c:he data
flow pririelple to contzwl ~.xm processing through the data
i
stream, i.e., the existcnc~ of data. In other words, a command
and/or an instruction (e . g J , s . s + inrut-,~t.rPam; ) is
prnrP~fiar3 ~n J.y when the operation may be executed diia ~.lm data
jU is available.
It is noteworthy that this~method dnPfi nor. usually result in
any change in the syntax o~ semantics ~f a, r~i5l~-lGVe1
ld.mguav,~.e. Thus existing high-level,.languagc code may be
NY01 Sfi7198 v 1 ; 1~

CA 02458199 2004-02-13
brought to execution on a vpU without problems by recompiling.
. 2 . 8 Loadina and ,gr..~rina data
Far the principles of the load/store operations, the following
is noteworthy.
The following types of addressing are treated dif~erantly:
1. External addressing, i.e., data transfers with external
modules
Intexnal addressing: "i~.e., 'data transfer between pABa, in
particular between RAM PREe arid ALU pABs.
Furthermore, sp~eial attention should be devoted to
chronological decoupling of data proceeding and data loading
and storing.
Hue transfers axe broken down into internal and external
transfers.
bt1) External read accesses (load operations) are separated,
and in on~ possible execution, they are also preferentially
translated into a separate cvnfiguxation. Th~ data is
transferred frvrn an external memory to an internal memor~r.
Z5
bt2) Internal accesses are linked to the data prooeeeing,
i.e., the internal memories (register operations) are read
and/or written to for the data prooassing_
bt3) External write acce8ses (store operations) are eeparatedr
i.e_, in a preferred possible embodiment they are also
translated into a separate configuration. The data is
transferred from an internal memory to an external, memory.
NY01 08T12S V 1 15

CA 02458199 2004-02-13
2t is essential that btl, bc2 and bt3 - i.e., loading the data
(load), processing the data (data Droces8irig and bt2) and
writing the data (bt3) - may be translated into different
configurations wriich may, if necessary, be ex~cuted at a
8.ifferent point in time.
This method will now be ~.llustrated vn the basis of the
following:
function example (a, b , integer) --r x . integer
for i:- 1 to 100
for j . 1 to 100
x ti] . a [i) w b [j ]
1S This function may be transformed by the compiler into three
parts, i.e., coritigurations (subconfig):
example#dload: loads th~ data from external (memory,
peripheral, ete_) and writes it into internal memories.
2o znternal memories are'identified by r# and the nam~ cf the
original variable.
example#process: corresponds to the actual data process, zt
reads the data out o~ internal operands sad writes the results
25 back into inte~cnal memoriep.
example#dstore: writes Lhe results from the internal memories
to external (memories, peripheral, etc.),
30 Function example# (a, b . integer) -> X . integer
subeonfig example#dload
for i . 1 to 100
r#a [i] . a [i]
for j : a 1 to 100
NY01 007128 ~ ~ L 6

CA 02458199 2004-02-13
r#b[j] := b[j]
subconfig example#proCess
~or i := 1 to 100
for j . s to 100
r#x (i] . r#a (i] * r#b [j J
subconfig example#dstvre
fOr i . 1 to 100
x (i] : a r#x (i]
One essential effect of this method ie that instead of i*j -
100 '* 100 = 10,000 external accesses only i+j -. 100 + 100
200 external aceesses are executed to read the operands. These
accesses axe also completely linear, which greatly accelerates
the transfer speed with today's bus systems (burst) and/or
memories (3DRAM, DDRRM, RAMBUS, ete.)_
The internal memory accesses are perform~d in parallel because
different memories have been assigned to the operands.
For waiting the results, i = 100 ext~~~'ilal accesses arc
rieceasary and these may also be performed again linearly with
maximum performance.
If the number of data transfers iS not known in advance (e. g.,
WHILE loop) or is very large, a method that reloads the
operands as needed through subprogram calls and/or which
writes the results externally may be used. In a preferred
embodiment, the states of the FIFOS may (also) be queried:
'empty' when the FIFO is empty and 'full' when th~ FIFO is full.
The program flow responds according to th~ states. It should
be pointed out that certain variables (e.g., a1. b1, xi) arc
defined globally. For perfoz'mance optimization, a scheduler
NY01 66T120 v 1 17

CA 02458199 2004-02-13
may execute the configurations example#pdloada, example#dloadb
even before the call of example#proaess according to the
methods already described, so that data i8 already prcloaded.
Likew3e~, example#dstore(n) may also be called up after
termination og example#procesa in ord~r to empty r#x.
subconfig example#dloada(n)
while !full(r#a) AND ai<=ri
r#a [ail . a [aid
a1++
aubconfig example#dloadb(n)
while lfull(r#b) AND bi<=n
r#b [biz , b [bid
I5 bit-r
subconfig example#dstere(n)
while !empty(r#x) AND xi<~n
x (xi7 . r#x [xi7
2 0 xa.++
subconfig example#~procees
for i : - 1 to n
for j . 7, to m
25 iE empty(r#a) then example#dloada(n)
if empty(r#b) then example#dloadb(m)
if empty(r#x) then example#dstore(n)
r#x [t1 . r#a (i~3 ~.....r#b [j 1 .... .
3 0 bj . 1
The subprogram calls and the administrata.on o~ the global
variables are comparatively complex for xeeerifigurable
architeoturee. Therefore, in a preferred embodiment, the
NY01 667128 v 1 1 B

CA 02458199 2004-02-13
following optimization may be perform~d, in which all the
configurations are running largely independently and terminate
after complete processing. Since the data b[j] is needed
repeatedly, example#dloadb must be run several time
accordingly. Two alternatives arc presented ~or this as an
example:
Alternativ~ 1: example#dloadb terminates after each run and is
reconfigured by example#process for each restart.
Alternative 2: example#dloadb runs infinitely and is
terminated by example#proce9s.
A configuration ie inactive (waiting) during 'idle.'
eubconfig example#dloada(n)
For i . 1 to n
while full (r#a)
idle
r#a [i] : = a (iJ
terminate
subconfig example#dloadb(n)
whi 1 a 1 // ~,,~'Ey~TATI VE
for i . 1 to n
while full(r#b)
idle
r#b [i] . ~ a [i1
terminate
eubcontig ex.ample##dsatorc (n)
for i . _ 1 to n
while empty(r#b)
idle
NYU1 067128 v 1 19

CA 02458199 2004-02-13
x [il : ~ r#x [i]
zerminata
subconfig example#process
for 1 . 1 to n
for j . 1 to m
whil~ empty (r#a) or empty (r#b) or full (rø#x)
idle
r#x [i) . r#a [i) * r#b C~ ]
cjo~nf.zac~ramr~~'[e#d~odadb n,) // ~?,Lv TERNATIjTF 1
.i a a l a o
terminate
To prevent walling cycles, configurations rnay also be
terminated ae aeon as they are temporarily unable to continue
fulfilling their function. The eOrrespoizding configuration ie
reme~cred from the reconfigurable module but remains in the
scheduler. The instruction°''xe~nter' iB used for this purpose
below. The rele~crant ~srariables are saved befoxe Lez"minativn and
are restored in the repeated configuration:
subConrig example#dlaada(n)
For ai : _ ~. to n
if full(r#a) reenter
r#a Lai1 := a Cai]
Lerminace
euboonfig example#dloadb(Al
~,Z 1e_ /l ALTERNATrVE 2
for bi := 1 co n
if full(r#b) ra~nter
r#b Cbi] . _ a [bi]
terminate
NY01 987128 v 1 2 0
r

CA 02458199 2004-02-13
subconfig example#datore(n)
for xi : - I to n
if ~mptg(r#b) reenter
.~c [xi J : = r#x [x1 J
terminate
subcox~fig example#proCess
for i . 1 to n
for j : = 1 to m
if empty(r#a) or empty(r#b) or full (r##x) reenter
r#x [i7 - r#a f~7 * f#b [j
Sonfia example#dloadb lr~~ /l ALTF,~2.NATIVE 1
te'~-m'~llare example#dloa~ /l ATIVE 2
terminate
1.5
2.3 Macros
More complex functions of a high-level language such as loops
are typically implemented by macros. The macros arc therefore
preael~cted by the compiler and era inatantiated at
translation time (see Figure 4).
The maoroe are eithex constructed from simple language
constructs of the high-level language or they are constructed
at the assembler level. Macros may b~ parameterized to permit
simple adaptation to the algorithm described (see Figure 5,
0502). In the present case, the macros are also to be
incorporated.
2.4 Feedback loo,.ps and :~e_g~~ters
Undelayed feedback loops may occur within the mapping of an
algorithm into a combinatory network and may oscillate in an
unvontrolled manner.
NY01 6671 z6 v i 21

CA 02458199 2004-02-13
In the VPU technologies implemented in practical terms
according to PACT02, this is prevented by a con8truct Of the
PAE in which at least one register is defined by decoupling,
more or less fixedly in the PASS.
In general, undelayed feedback loops are discernible through
graphic analysis of the resulting combinatory network.
Registers are then deliberately inserted for separation into
the data pathways in Which there is an undelayed feedback
loop. The compiler may thus manage registez' means, 1.e.,
memory means.
Due to the use og handshake protocols (e. g., RDY/ACK according
to Z.z.7), the correct functioning of the computation is also
ensured by the insertion of registers.
2_ Ode tim c~ 'n t'
Basically each PAE matrix implemented has ~~i,rtually only one
finite variable. In the following step, a partitioning of the
algorithm aocording to section Z.2.5 paragraph 4 a/b into a
plurality of configurations, which are configured one after
the other into the PAE matrix, must therefore be performed.
The goal ie typically to compute the largest possible z~urnber
of data packets in the network without having to perform a
2S reconfiguration.
Between configurations, a temporary storage is performed, with
the temporary memory - similar to a register in the case of
CPUs - storing the data between the individual configurations
that are executed sequentially.
A sego~ntial processor model is thus created by r~configuring
configurations, by data processing in the PAE matrix and
temporary storage in the memories.
NY01 887128 v 1

CA 02458199 2004-02-13
Ia other words. duc to the compiling described here, an vpcode
is not executed sequentially in the VPU technology lout instead
complex oonfigurations are executed. While in the c~.se of
CPUS, an opcode typically processes a data word, in vPU
technology a plurality of data words (a data packet) is
processed by a configuration. The efficiency of the
reconfigurable architecture is therefore increased ue to a
better ratio between reconfiguration, complexity and data
processing.
At the same time, instead of a register, a memory i used in
VPU technology because instead of proveeeing data w rds, data
packets are processed between configurations_
z5 This memory may be designed as a random access memo y, a
stack, a FIFO or any other memory architevtuxe, alt ough a
FTFO is typically the best and the easiest option to be
implemented.
Data is then pYOCessed by the PAE matrix according'to the
configured algorithm and .~s stored in one or more emories.
The PAE matxix ie reconfigured after processing a olume of
data anal the new configuration wiChdraws the irate ediata
results from the memory (memories) and continues t a execution
of the.progrdm. New data may then of course be inC rporated
into the computation from external memories and/or the
periphery and likewise reSUlLS may be written to a terx~.a1
memories and/or the periphery.
3o In other words, the typical' seqiie~'ice in 'data prvce sing is
read out from internal RAMS, processing the data i the matrix
and writing data to the internal memories, and any external
Sources yr targets for the data transfer may be us d
additionally or instead of the internal memorise for the data
NY01 667128 v 1 2 3

CA 02458199 2004-02-13
processing.
Although "sequencing" in CFUS is defined as never loading of an
opcode, "secauericing" of VFUS is thus defined as (re) configuring
configurations according to what was said above. However, this
does not mean that under certain coridizions, parts of the
field may not be operated. as a sequences in the conventional
sense.
The information as to when and/or how sequencing ie perform~ad,
i.e., which next configuration is to be configured, may be
represented by various information which may be usCd
individually or in combination. ~'or example, zhe following
strategies are appropriate for deriving the information either
15_-~.alone_ and/or in combination, 1.e., alternatively=
a) defined by the compiler at translation time;
b) defined by the e~rent network (trigger. DE 197 04 728.9),
whereby the event network may represent internal and/or
external states;
c) by the degree of filling of th~ memorise (trigger,
DE 197 04 728.9, D~ 196 5a 846.2-53).
2.5.1 Influ nc o~he TDM on the ~proceeeor model'
Partitioning of the algorithm determines Lo a significant
extent the relevant states which are stored in the mcmoriee
between the various configurations. =f a state ie relevant
only within a configuration (locally relevant state), it is
not necessary to etor~ it, which is taken into account
preferentially by the compiler method.
For the purposes of debugging the program to be executed,
however, it may n~vertheless be expedient to store these
states to permit access to the debugger. Reference is made to
NY01 salsas v 1 24

CA 02458199 2004-02-13
DE 1.01 ~2 904_5 which is herewith incorporated to Lhe full
extent for disclosure purposes.
In addition, additional states may become relevar~.t when using
a clock switch mechanism (e.g., by an opexating system or
interrupt sources) and the configurations being executed
currently are interrupt~d, other configurations are load~d
and/or the aborted recontlgux'ativn is to be continued at a
later point iri time. A detailed deaaription follows in the
ZO section below.
A simple ~xample should show a differentiating factor for
locally relevant sCates:
a) A branching of the type "if () then ... else ..." fits
completely into a single configuration, i.e., both data
paths (branches) are mapped jointly completely within Lhe
configuration. The state that is obtained in Lhe
comparison is rele~rar~t, but locally, because it is no
longer needed in the following configurations.
b) The same branching is too large to fit completely into a
single corifiguraLivn. Multiple ooafiguratioris are
necessary to map the complete data paths. In this case,
the statue is globally relevant and murst be stored az~.d
assigned to the pafLicular data because the following
configurations requa.xe the particular state of the
comparison in further processing of the data.
2.6 Task switchi~a
The possible use of an operating system has an additional
influence on th~ consideration and handling of states.
Operating systems use task Schedulers, for example, to manage
multiple tasks to make multitasking available.
NV01 667128 v 1 25

CA 02458199 2004-02-13
Task schedulers interrupt teaks at a certain point in time,
and stare other tasks, and after their processing, they return
to further processing of the interrupted task.
If it is ensured that a configuration - which may correspond
here to processing a task - is terminated only after being
completely processed, i.e., when all Lhe data and states co be
prooessed within this Configuration cycle hare been stored -
then locally relevant stator may r~main unstored, This meirhod,
l0 i.e., complete processing of a configuration and the
subsequent task switch is the preferred method for operating
reaon~igurable processors and corresponds essentially to the
sequence in a normal processor which first completes the
assignments currently being processed anc~ then changes tasks.
For some applications, however, a particular~,y short response
to a task change request is necessary, e.g., in realtime
applications. It may be expedient here to abort configurations
before they are completely processed.
If the task scheduler aborts configurations before they are
completely proaeeeed, local states and/or data must be stored.
In addition, this is advantageous when the processing time o~
a configuration is not predictable. In conjunction with the
known holding problem and the risk that a configuration may
not terminate at all, e.g., due to an error, this also seems
appropriate so that a deadlock of the entire system may be
pr~vented. Therefore, in Ghe present case, relevant states
that are necessary for a task ohange and a renewed correct
resumption of the data processing are also to be regarded as
such, taking into account task switching.
Thus in task switching, th~ memory is to be secured for
results and, if necessary, the memory for vper~snds ie alco to
NY01 6fi7128 v 1 2 6

CA 02458199 2004-02-13
be secured and restored at a Later point in time, i.e., when
returning to this task. This may be performed in a manner
comparable to that used with push/pop instructions and methods
according to Lhe related art. In addition, the state of data
processing must be s~cured, i.e., the pointer ac the last
operand processed completely. Reference is made here to PACT18
in particular.
Depending on the optimisation of task switching, there arc two
possibilities here, for example:
a) The aborted configuration is 'reconfigured and only the
operands are loaded. Data processing thus begins anew as
if processing of the Configuration had not yet begun. In
other words, all data computations are simply executed
from Lhe beginning, with computations already being
performed in adwanae, if necessary. This possibility is
simple but net efficient.
b) The aborted configuration is~reconfigured, with the
operands and the results already computed being loaded
into the particular memory. Data processing continues
with the operands that have r~o longer been computed
Completely. This method is much more efficient but it
presupposes that it necessary addltiorial states, which
occur during processing o~ Lhe configuration, may be
rele~rant, it necessary, e.g., if at least one pointer,
pointing to the last operands completely computed, must
be secured, ev that after a successful new configuration,
the pointer may be reset at their succes9ore.
2.7 Conr~ext switching
A particularly preferred variant for managing relevant data is
NYoi ear~zs ~ ~ 27

CA 02458199 2004-02-13
made availab7.e through the context switch described belvwr. xn
task switching and/or in execution of configurations and in
changing them (see, for example, DE x.02 06 653.1, which is
herewith incorporated to the full extent for disclosure
s purposes), it may be necessary to save data or states which
are typically not stored tagether with the operating data in
the memory, because they only mark an ~nd value, for a
subsequent configuration.
1o The context switch which is implemented pxeferentia7.ly
according to the present invention ie performed in such a way
that a first configuration is removed and r.he data to be aave~d
remains in the corresponding memories (REG) (memories,
registers, counters, ete.).
A second configuration which connects the REG to one or more
global memories in a suitable manner and in a defined sequence
may then be leaded.
2o The configuration may use address generators, for example, to
access the global memory (memories). Thus it is not necessary
to first have each individual memory site defined by the
compiler and/or to access REG embodied as memories.
26 According to the configured connection betwe~n the R~Gs, the
contents of the REC3s are written in a defined sequence into
the global memory, the particular addresses be~,ng preselected
by address generators. The address generator generates the
addresses for the global memory (memories) in such a way that
the memory regions written (POSH AREA) may be unambiguously
assigned to the first configuration that has been removed.
Different address spaces are thus provided preferentially for
different configurations. The configuration hare aorr~spond to
NV01 687128 v 1 2 $

CA 02458199 2004-02-13
a PUSH Of ordinary processors.
Other configurations then use the resources.
Now the first configuration is to be started again. First, a
third configuration which connects the REGs of the first
configuration, in a defined sequence is started.
The configuration may in turn use address generators, for
example, to access the global memox'y or memories and/or to
access REGe embod~.ed as memories .
An address generator preferably gen~rates addresses in such a
way that a correct access to the PUSH AREA assigned to the
first configuration takes place. The generated addresses and
the configured seguence of the REC3e are such that the data in
the REG is written from the memories into the REDS in the
original order. The configuration corresponds to a POP of
ordinary processors.
ao
The first configuration ie then started again.
In summary, a context switch is pregexably performed in such a
way that by loading particular configurations which operate
like pUSH/POP of known procecaor architectures, the data to ba
saved i9 exchanged with a global memory. This data exchange
via global memorie6 using FUSH/POp exchange configurations is
regarded as being particularly relevant.
This function is to be illustrated in an example: A function
adds two rows of number's; the length of the rows is not known
at translation t~.me, and is only known at run time.
pros example
NY01 887128 v 1 2 9

CA 02458199 2004-02-13
while i<length do
x [iJ - a [iJ + b [i7
i = i + 1
The function is th~n interrupted during its exeCUtlon, e.g.,
by a task switch or because the memory provided for x is full;
a, b, x are at Chis point :Ln time in memories according to tho
present invention; however, i and length, if necessary, must
be saved.
15
To do so. the eXample confrgurativn is terminated, with the
register contents being saved and a push configuration being
started, reading i and length out of the registers and writing
them to a memory_
proC push .
mem [<pueh adr examplea] ~ i
push adr ~xample++
memf<push_adr_example>) -. length
After execution, push i~ terminated and the register contents
may be deleted.
Other configurations are ex~auted. After a period of time, the
e~cample configuration is started again.
First a pop configuration is started, reading the register
contents out of the mernory~again.
proc pop
1 = mem [<push_adr examp7.e~]
push~adr example++ ~,~ ..
length = mem L~push adr~s:xample>J
NY01 66712A v 1
v,

CA 02458199 2004-02-13
After execution, pop ie terminated and the register contents
are kept. The example configuration is started again.
2 _ ~ 2~laorichmic optimization
Control structures are separated from algorithmic structures
by the translation method described here. For example, a loop
is broken down into a body:(WHILE) and an algorithmic
structure (instructions).
The algorithmic structures'~re preferably optionally
optimiZable by an additional tool downstream from the
separati.an .
For example, a downstream algebra eo~tware may optimize and
minimize the programmed algorithms. Such tools arc known bY
such names as AXIOM, MARBLE, etc., for example. By
minimization it is possible to achieve a more rapid execution
of the algorithm and/or a ~ubstant~.ally reduc~d space
requirement.
The result of the optimiza,'tion is Lhen sent back to the
compiler and processed further accordingly.
z5 It should also be pointed'out that modern compilers (front
ends) have already implemented a number of optimizations for
algorithms (including algebraic optimizations in some cases)
which are of ceurse atill:ueable as part of the method
described here.
It should be pointed out explicitly that the method described
here, in particular section 2.2.7 "Handling of time" and
section 2.3 "Macros," may also be u3ed On compilers according
to PACT20. PACT20 is here~rith included to the full extent iri
IVY01 6OW Ze v y . 31

CA 02458199 2004-02-13
the present patent applicaition for disclosure purposes.
3 npplicabilit~fcr ,.ode~ssors accordincLto th,~related
art in nart'~lar pith a 'TLIVP arc~E~tecty~rd
It should be pointed out in particular that inetead.of a PAE
matrix, a Configuration of~arithmetic logic units (ALLTs)
according to the related art may bo used, such as Chose
conventionally used in VLIW processors and/or a configuration
of complete processors suCT~i as thvsc conventionally used in
multiprocc5sor systems, fob exampl~. The use of a single ALU
constitutes a special eas~~here, so that this method is also
applicable for normal CPLTs~~
A method which permits trapslation of tha WHILE language into
semantically correct finit's automatons was developed in the
dissertation (reference dissertation Armin Nttckel] . In
addition, a finite automaton may be used as a "subprogram" and
vice-versa. This yields tk~ possibility of mapping a
configuration onto differexzt implementation technologies such
as CPUs, symmetrical multiprocessors. FEGAS. AS=Cs, VPLy'a.
In particular, it is posai:ble to assign the optimally suitable
hardvu~are to parts of an application and/or to determine a
particular suitability and assign optimum hardware on the
basis of the more or lessgood suitability, Preferably
temporary distribution and reservation of resources may also
be detectable. Tn other words, for example, a data flow
structure would be assigned to a data flow architecture, while
a sequential st:ruCture is;mapped onto a sequences if these are
available and/or present.
the corresponding problems of Lhe resource assignments ~or the
individual algorithms may; be solved, e.g., by a "job
assignment" algorithm for~managing th~ assignment.
NY01 887128 v 1 ~ 3 2

CA 02458199 2004-02-13
4. Implementation
The implementation of a compiler aCCOrc3irig to the present
invention should start rocll a "normal" sequential programming
language such as C or P saal. These languagec hav~ the
property that due to th ir;,sequentia7. character, a
chronological sequence s implicitly and a~'tifieial7.y
generated Sue Co the la gu~ge definition per se.
Example A:
Line 1: i++ .
Line 2: a = 1 % b
Line 3: x = p - a .
Line ~: j - i ~ i
The language definition f~.~edly predetermines that line ~. is
executed before line 2 e~Qre line 3 before line ~. Howevex,
line 4 could also be ~x cued directly after line 1 and thus
in parallel with lines arid 3.
In ether words, through sequential langue.gee, additional
artificial and not algo ithmically determined states are
incorporated_ only the orrect chronological sequence of the
computations in example A 'is important. Line ~ must not be
computed until i is cor scaly defined, i.e., after the
processing of line 1. L'ne~2 also must not be processed until
after the correct defiri'tiOri of i (1.e., after processing
line 1). Line 3 require zhe results of lines 2 (variable a)
and therefore may not b computed until aftor correct
definition of variable . ,This results in a data dependence
but no particular state
On the basis of the dat dependencies of variable a in lines 2
and 3 (line 3 uses a as a1'i.operand; a is the result of line
NYO~ oo»2e v ~ ~ . 33

CA 02458199 2004-02-13
2), the following transf rmation may thus be performed
automatically by the com filer for representation of the
parallelizability and/or ectorizability (Parvea
transformation)
Line z. VEC{a = i * b;
Line 3: x = p - a}
VfiC means that each term sep;3rated by ';' is processed in
l0 succession; the terms wi hin Lhe braces may basically be
pipelined. All computati s must preferably be performed at
the end of VEC{} and eon luded so that data processing is
continued downstream fro the 'U'EC.
In an internal represent tion of the data structures in the
compiler, the two comput tions are preferably marked as a
vector:
VLC{s=iwb; x=p-q~
Line 4 yields a simple ~cr~cto:r
V'ECtj = iwi}
Since line 4 may be comp ted at the same time as lines 2 and
3, the parallelism may b~ expressed as follows:
PAR{ {vEC{a..i*b; x-p-a} ; U~EC{;j=i*i} }
3o PAR means that each Lerm separated by '{..}' may be processed
at the same time. All co putations mue~t preferably be
performed and eonclud~d t th~ end o~ PAR{} so that data
processing is continued ownstream from PAR.
NY01 667128 v 1 1 3 4

CA 02458199 2004-02-13
If line 1 is included, t~.is yields;
VEC{i++; PAR{{VEC{a~i*b:jx=p-a}}{VEC{j=i*i}}}}
since VEC{j=i*i} represents a vector with only one element, it
may also be written as f flows: '
VEC{i++= pAR{{VEC{a=i*b;~x=p-a}}{j=~*~}}}
Another example shows a deal state.
Example 8:
Line 1: i++
Line 2: a = 1 * b
Line 3: if a c 100 {
Line 4: x = p - a
Line 5: } else {
Line 6: j - i * i }
Line 6 may now be exevut d only after the computation of lines
2 arid 3. The computation of lines ~ arid 6 takes place
alternatively. Thus, the state of line 3 is relevant fvr
further data processing relevant state?~
Conditional states may b~ expressed by IF in the case of a
transformation 1I:
Lime 1-2: VEC{i+*; =i*b}
Line 3: IF{{a<100 {line4}{line 6}}
Line 4: 'fEC{x=p-a
Line 6; 'tl'EC{j=i*i
In summary, this yields
VEC{i+~r; a=i*b; IF{{acl0(~}{vEC{x=p-a}}{VEIC{j=i*i}}?~
Nvo~ ss~t 2a ~ t I 3 5

CA 02458199 2004-02-13
Other relevant states ark generated by looping:
Example C:
Line 1: for (i ~- 1, i 7.00, i++)
Line 2: a = a * i
Line 3: q = p / a
Line 3 may be executed o 1y after th~ loop has been
terminated. Thus there a a relevant states at conditional
7. 0 j ump s .
A first transformation o~ the loop yields:
Line 1: i=1;
Line 2 : loop : if i >= 7. o then exit
Line 3. a = a * i
Line 4: i++
Line 5: jump loop
Line 6: exit: q = p / a
Lines 3 and 4 may be commuted in parallel because line 4 does
not depend on the result of line 3:
PAR{{a=awi~ti++?}
Line 5 yields a vector w th the generated PAR because only
after complete computati n of the values is it pvssi,ble to
jump back into the loop so thore is a time dependence her~).
VEC{PAR{{a=a*iy{i++~); j mp loop3
Th~,s yields the follora~inc~ for the condition:
loop: IF{{i>=100}{jump a it}{VEC{PAR{{a=a*i~~i++}}; jump
lcvp}~}
NY01 B8712B v ~ 3 6

CA 02458199 2004-02-13
Line 1 is a vector with he condition since it must be
executed before the cond Lion (=F uses i as an operand, i is
the result of line 1).
Line 6 is in turn a vector with the condition since a is used
as the operand and a is qhe result of the condition.
This yields the following (in simplified notation):
v~~{
zo i++;
loop: IF{
{i~.=100}
{jump exit}
{vEC{
~s PAR{
{a=a i}
{i++}~
}.
j ump loop
}1
exit: q=p/a
The contents of VEC{} an~ PAR{} may be considered as purely
combinatory networks.
VEC and PAR are preferab y designed ae a Petri network to
control further processi g after complete processing of the
particular content, as 1 preferred.
Due to the possibility o considering VEC and PAR as z pur~ly
combinatory network, it ecomes necessary to ensure the state
NY01 687128 v 1 3 7

CA 02458199 2004-02-13
of the loop, zn other woods, in this case it is in fact
necessary to create a fi its automaton.
The instruction REG(} st res variables to a register to this
end. Thus, a flnlte auto aton which is constructed exactly
according to the algorit m'is obtained due to the use of the
combinatory networks VEC and PAC in conjunction with the
register REG:
'V'E C
i++;
loop: IF{
{i~~100}
{jump exit
{VEC{
PAR{
{ a=a i.}
l.~h+~
REG(~a: i)
dump loop,
};
exit: q=p/a
It should be painted out in particular that in the applicant's
VPU technology (see PACT Z) input and/or output registers are
provided on the PAEs and the correctness over time and the
availability of data are ensured by an integrated handshake
protocal (RDY/ACK). To t is extent, the requirement i9
preferably met on leavi.n VFC(} or PAR{} whose internal data
processing is concluded utomazically for all variables to be
NY01 8BT128 v 1

CA 02458199 2004-02-13
used subsequently (i,f da a,~rocessing were not concluded, the
following computation sz ps would wait for the end and the
arrival of the data). Os illatory Feedback is else ruled out
through the integrated r gisters.
To this extent, the Poll wing term i9 correct for this
technology:
VEC{PAR{{a=a*i}{i++}}; j mp loop}
lo For other technologies w ,ich do not have the aforementioned
embodiments or have them only partially, the term should be
formulated as follows:
VEC{PAR{{a=a*i}{i++}}; R G{a,i}; jump loop)
It should be pointed out that this form results in correct and
optimum mapping of the a gorithm onto the reconfigurable
processor even in the ap licant's VPU technology,
REC3 may be used within t a combinatory networks VEC and PAR.
Strictly considered V8C rr,d 8AR thus lose the property of the
combinatory networks. Ab tractly, however, REG may be regarded
as a complex element (RE d ement) of a combinatory network vn
which its own processing time is based. Processing of the
following ~lsrnents is ma ~ dependent on the termination of the
~5 computaeion of the REG a ement.
.._._.__~,~~_..a~ ..awarene~e--of ~Fi~._.
~aoz~,"c.e'p"'~uaT~...iriaccura'v'y~"'.use~..o~......RE.G....., ..
....___.._..........
within VFC and PAR is al owed in the remaining course and is
also necessary in partic lar.
Ae already mentioned abo e, the use of REG ie typically not
necessary within a confi uratian of a VPU of the applicant,
buz is expliciciy always necessary when the computation
results of a eonfigurati n.are stored ao that RECD in fact
NY01 667128 v 1 ~ ~ 3 9

CA 02458199 2004-02-13
corresponds to the explicit register of a finite automaton 1n
i
this application case.
In addition to synth~si~ of finite automatons for loo8s,
finite automatons are nE~cessary in particular in another case:
Tf an algorithm is too large to be processed completely within
the pAEs of a reconfigu~able processor, it must b~ broken down
into several partial al orithms. Each partial algorithm
represents a configurat'on for the recvnfigurable processor.
The partial algorithms re configured in succession, i.e.,
sequentially on the processor, the results of the preceding
configurations) being used as operands for the particular new
configuration.
i
In other words, the reconfiguration rmsults in a finite
automaton which process's and stores data at a point in time
t, and at a point in ci a t + 1, possibly after a
configuration, the etordd data is proaeseed differently, if
rieeessary and stoxed aglin. It is essential that t is not
defined in the traditio al sense by clock pulses or
instructions but instea by configurations. Reference is made
here in particular to thJe processor model presentation (PACT,
October 2000, San Jose).I~
In other words, a cvr~figluration includes a combinatory network
of VEC and/or PAR the results of which are stored (REG) for
further use in the next ~on:Eiguration:
Configuration 1: 'VECtIOperanda;fVEC~PAR};RECi(Resultsl}~
Configuration 2: V'EC{~Rasultsl~(VEC~PAR)~REG{Resulte2?~
For the sake of simplieijty, the above examples and
NY01 687128 v 1 ~ 4 0

CA 02458199 2004-02-13
descriptions have iritrod ed the Constructs VEC, FAR and REG
into high-level language nd structured them accordingly.
Typically and preferably, this structuring is only performed
at the le~rel of the inter ediat~ language (see Principles of
Ccunpiler Design (Red Drag~onJ , Aho, Sethi, Ullmann) .
It should be pointed out lin particular that the structuring of
algorithms with VEC, PAR nd RECD is typically performed
completely automatically y the compiler by methods such as
graph analysis, for example.
In particular, however, i is also conceivable and to some
extent advantageous to allow the programmer himself the
structuring in the high-1 vel language so that VEC, PAR and
REG are still writable d1 ectly in the high-level language as
described above.
Automatic generation of C, PAR, and REG may be performed at
different levels of a to ilation opexation. The initially
most enlightening is desc ibed during a preprocessor run on
the basis of the source c de as in the preceding examples.
I-Towever, a specially adap ed compiler ie then necessary for
further compilation.
Another aspect is that compilers usually perform automatic
optimization of code (e. g., in loops). Therefore an efficient
breakdown of the code is easonable only after the
optimization runs, in par icular when compilers (such as SUIF,
L7niversity of Stanford) a e.already optimizing the code for
parallelization and/or ~reptorization.
The particularly prelerre~l method is therefore to tie in the
NY01 667128 v 1 I 41

CA 02458199 2004-02-13
i
analyses into the back er~d of a compiler. The back end
translates a compiler-infernal data structure to the commands
of the target processor.
Usually pointer structurgs such as DAGs/GRGs, trees or
3-address codes are usually used a8 the compiler-internal data
structures (see Pz~jnc~ples of Compiler Design (Red Dragon),
Aho, Sethi, Ullmann). To;some extent stack machine cod~s are
also used (see "Compiler ieelf-tailored," C'T, 1986 1-5) . Since
to the data formats are equivalent in principle and may be
transformed into one another, the preferred method according
to the present invention~is based on further processing of
graphs' preferably trees
Data dependencies and possible parallelisms corresponding to
the method described aboire may ba discerned automatically
within trees simply on t~e basis of the structure_ To do so,
for e~cample, known and e~aablished methods of graphic analysis
may be used, for exampl~. Alternatively or optionally, an
algorithm may be analyzed for data dependencies, loops, jumps,
etc. through appropriate~.y adapted parsing methods. A method
similar to the method used for analyzing terms in compilers
may be used.
Magpina
Further transformaCion o~ the algorithm now depends greatly on
the target architecture.~for example, the present applivant's
processor architecture (IVPU, XPP) offers automatic data
synchronization in the hardware. This means that the correct
chronological sequence of data dependencies is automatically
handled in the hardwara.Other architectures in some cases
additionally require synl~hesis of suitable state machines for
controlling the data tra~lsfer.
NY01 BB7~28 v 1

CA 02458199 2004-02-13
i
Handling of conditional jumps is particularly interesting. For
example, the present app~ieant's processor architecture makes a
plurality of mechanisms ~or mapping and execution available:
1. Reconfiguration of the processor or parts of the
processor by a higher-level configuration unit (see the
patent applicat,ion(~) pAC201, 04, 05, 1,0, 13, 17)
2. ~.olling the function into the array of PAEs (see the
pat~nt application ~Ifl~CT08), where both possible bxanches
of a comparison are~mapped onto the array at the same
to time, for example.
3. Wave reconfiguratior~ according to the patent
application(e) PACTO!'8, 13, 17~ a token is given to the
data which is to be~processed differently, and the token
selects each valid cjonfigura>rion.
It should be pointed out~that mechanism number ~, is in general
the case typically to beiused. Mechanism 2 is already very
complex with most r,echnol;ogies or it is not even
implemcntablc, and case 3~ is ao far known only from the VPU
technology of the precenli applicant.
The execution method to ~e selected in each case will depend
on the complexity of the algorithm, the required data
throughput (performance) 'and the prec~.s~ embodiment of the
I
target processor (e. g., the number of PAEs).
Examples: I
A simple comparison shoulid compute the following:
if i < 0 then a=a* (-i) el!se a=a*i
Reconfiguration of the px?oces9or (mechanism 1) depending on
the result of th~ comparison se~ms to bw 1~ss appropriat~. It
is basically possible to ,roll both branches into the array
(mechanism 2). Depending on the result of the comparison, the
NY01 667128 v 1 4 3
1

CA 02458199 2004-02-13
PAEs computing either a=a*(-i) or a~axi are triggered (see
PACT08).
It is particularly efficient in terms of space to superimpose
the two computations (mechanism 3) so that after the
comparison, the same PAEs continue to process the data
regardless of the result, hut the data ie provided with a
token which then selects either the function a=a*(-i) or a=a*i
locally in the data processing PASS which follow the latter as
z0 a function of the (results of the] comparison (see pAC208, 13,
17) .
According to mechanism 1, there is a globally relevarit scale
because the complete following configuration depends on it.
According to mechanisms 2 and 3, there is only a locally
relevant state, because this is no longer needed beyond the
compuzazion - which is completely implemented.
2o zn other words, the local or global r~lavance of states ma.y
also depend on selected mapping on the processor architecture.
A state which is relevant beyond a configuration and thus
beyond the combinatory network of the finite automaton
representing a configuration (i.e.. needed by following finite
automatons) may basically be considered as global. The diffuse
terminology of the term catllbinatory network as used here
should also b~ pointed out.
3o Instruction model of the resultinc~os~ssor
According to the present invention, there ie a processing
model for reconfigurable prooessors which includes all tho
essential instructions: '
NY01 867128 v 1

CA 02458199 2004-02-13
Arithmetic/logic instructions are mapped directly into the
combinatory network.
~7um~s (jump/call) are either rolled directly into the
combinatory network or are implemented by reconfiguration.
Condition and control flow ~natruetions (if, etc.) are either
completely resolved and processed in the combinatory network
or are relayed to a higher~level configuration unit which then
executes a reconfiguration. according to the resulting status.
Load/store operating are preferably mapped into separate
configurations and are implemented by address generators
similar to the known DMAs which write internal memories
1.5 (REG{}) into external memories by using address generators or
which load them from exterxial memoxies and/or peripherals.
However, they may also be configured and operated together
with the data processing configuration.
~?gg~s z~~,r on. a.ions are implemented in the combinatory
network by buses ber,ween the internal memories (REG{}).
Puah/Pop operations are implemented by separate configurations
which optionally write certain internal registers in the
combinatory network and/or the internal memories (REt3{}) by
address generators into external memories or read them out of
external memories and are pref~rably executed befoxe or after
the actual data processing configurations_
5. Description of the f~,gZ~,~~
The following figures show,implementation e~ramples and
exemplary embodiments of the compiler.
NY01 6fi7128 v 1 . 4 5

CA 02458199 2004-02-13
Bigure la shows the designs of an ordinary finite automaton in
which a combinatory netwo k (0101) is linked to a register
(0102). Data may be sent irectly to 0101 (0103) and 0102
(0104). Froceseing of a a ate ae a function of the preceding
states) is possible by f edback (0105) from the register to
the combinatory network. he processing results are
represented by olos.
~'xguro 1b shows a repress tation of the finite automaton by a
reconfigurable architeCtu a aCCOrding to PACTO1 and PACT04
(PACT04 Figures 12 throug 15). The Combinatory network from
Figure la (0101) is reply ed (0101b) by a configuration of
PAE9 0107. Register (0102 is implemented by a memory (0~.02b)
which is capable of stori g a plurality of cycles. The
feedback aCCOrding Lo 010 Lakes place via 0105b. Inputs
(01013b and/or 0104b) are equivalent to 0103 and 0104,
respectively. Direct acce s to 07.02b may be implemented via a
bus through array OlOlb, 'f necessary. Output 0106b 1s in turn
equivalent to 0106.
Figure 2 shows the mappin' of a finite automaton onto a
reconfigurable architectu e_ 0201(x) represents the
combinatory network (whic may be designed as PASS according
to Figure 1b). There exis one or more memories for operands
2S (0202) and one or more me ories~for results (0203). Additional
data inputs according to 103b, 0104b, 0106b are not shown for
the sake of simplicity. address generator (0204, 0205) is
assigned to the memorise.
Operand memory and results memory (0202, 0203) are physically
or virtually interlinked ri such a way that the results of a
function and/or an operat'on may be used for something other
than an operand, for exam 1~, and/or both results and newly
added operands of one fun Lion may be used as operand for
NY01 887128 v 1 ~ 4 6

CA 02458199 2004-02-13
another function. Such a oupling may be established, for
exam8le, by bus systems o by (re)configuration through which
the function and networki g of the memories with the 0201 is
newly configured.
Figure 3 shows di.fterent spects for handling variable~. zn
~'igttre 3a, 0301, 0302 and 0303 indicate different stages of
computation. These stage9 may be purely combinatory or they
may be separated via regi ters, where t1, f2, f3 are
functions, x1 is a variab a according to Che patent
description.
Figure 3b shows the use o~ a variable x1 in the function
x1 , x1 + 1.
Figure 3C shows the behav yr of a finite automaton for
computation of x2 . x1 t 1 within a configuration. In the
next configuration, 0306 nd 0304 are to be exchanged in order
to obtain a complete fini a automaton. 0305 represents the
ZO address generators for me vries 0304 and 0306.
Figure 4 shows implements ions of loops. The modules shown
hatched may be generated y macros (0420, 0421). 0421 may be
inserted by analysis of t a graph into undelayed feedback.
Figure 4a shows the imple entati.on of a simple loop of the
type
WHILE TRUE DO
x1 . x1 t 1;
?fit the core of the loop is counter +1 (0401). 040a is a
multiplexer which at the eginning carries the starting value
of x1 (0403) to 0401 and then causes feedback (0404a, 0404b)
with ~aeh iteration. ~ r gister (see REG(}) (0405) to prevent
NY01 887128 v 1 I 4 7

CA 02458199 2004-02-13
undelayed and thus uncontrolled feedback of the output of 040.
to its input. 0405 is pul d with the working clock pulse of
the V'PU and thus determin the number of iterations per unit
of time. The particular c nter coat~nt could be picked up at
040aa or 0404b. Depending n the definition of the high-level
language, however, the to does not terminate. For example,
Lhe signal at 0404 would useful in an HDL (according to the
related art (e. g., vH~T~, rilog)), whereas in a sequential
programming language (e.g., C) 0404 it is noC usable Since the
loop does not terminate a thus does not yield any exit
value.
Multiplexer 0402 implemen a macro formed from the loop
construct. This macro is stantiated by the translation of
WHILE.
Register 0405 is either a o part of the macro or is inserted
according to a graphic 1ri ~lysis according to the related art
exactly at the location a time where ther~ is undelayed
feedback in order to thus ule out the vibration tendency.
Figure 4b shows the desi of a true loop of the type
WIiIhE x1 c 10 DO
x1 a - x1 + 7. p
This design corresponds seritially to Figure 4a which ie why
the same references have eon used.
In addition, a cirouit l shown which checks on the validity
of the results (0410) an relays the signal of o40~a to the
downstream functions (04 ) only when the abortion criterion
of the loop has been rea ed. The abortion criterion is
determined by the compar' on x1 < 10 (comparlaon stage 0412).
As a result of the comps son, Lhe particular status flag
(0413) is sent to a mult' tier means 0402 for controlling the
NY01 887128 v 1

CA 02458199 2004-02-13
leep and the function 041 Ifor controlling the forwarding of
results. status flag 0413 nay be implemented, for example, by
triggers according to DE 7 04 726.9. Likewise, otatus flag
means 0413 may be sent to. CT which then recognizes that the
loop has been terminated d it performs a reconfiguration.
Figure 5a shows the iterative computation of
FOR i : ..1 TO 10
x1 .~ x1 * x1;
Essentially the basic fun ion correspond to that in Figure
4b, which is why the refs ~nces have also been used here.
Function block 0501 compu s the multiplivation. The FOR loop
is then implemented by an her loop according to Figure 4b and
is indicated only by bloc 0503. Block 0503 supplies the state
of the comparison of the ortion criterion. The state is used
direotly for triggering t~ iteration, thus largely
eliminating means 0412 (r resented by 0502).
Figure Sb shows the rolli of the computation of
FOR i:=~ TO 10
x1 . x1 * x1;
Since the number of itera ons at translation time is known
exactly, the computation y be mapped into a sequence of i
multiplexing actions (051
Figure 6 shows the execut n of a WHILE lOOp aCCOZ'ding LO
Figure 4b over several co igurativns. The state of the loop
(0601) here is a relevant torte because it has a signifiea.nt
influence on the function n th~ subsequent configurations.
The computation extends o r four configurations (0602. 0603.
0604, 0605). Between conf ~uracions, the data is stored in
memories (see REC3~}) (060 0607). 0607 also replaces 0405.
The filling level of the armories may be used as a
NY01 667128 v 1

CA 02458199 2004-02-13
reconfiguration aricerion a indicated by 0606, 0607: memory
full/~mpty and/or 0601 whit indicates abortion of the loop.
In other words, the fillip level of the memories generates
triggers (see PACT01, PACT05, PACT08, PACT10) which arc sent
to the configuration unit pd trigger a reconfiguration. The
state of the loop (0601) m y also be sent to the CT. The CT
may then configure the sub equent algorithms on reaching the
abortion criterion and/or ay first process the remaining
parts of the loop (0603, 0 04, 0605) if necessary and then
load the subsequent config rations.
6. Parallelizabilibv
Figure 6 shows potential l~.mits to paralleliza.bility~.
is
It computation of the open nde ie independent of feedback
0608, the loop may be comp fed in blocks, i.e., by tilling
memory 0606/0607 ~ach time This achieves a high degree of
parallelism_
If the computation of an o exand depends on the result of the
previous computation, i.e. feedback 0608 or the like ex~tere
into the computation. the ethod is less efficient because it
is possible to compute onl one op~rand arithin the loop.
If the usable ILP (instrut ion level parallelism) within the
loop is high and the time or x-econfiguration is low (see
PACT02, PACT04, PACT13, P T27), a computation, yelled oilt0
PAEs may also be efficien on a VPU.
If this is not the case, t is appropriate to map th~ loop
onto a Sequential archite tune (pros~ceox separate from the PA
or implementation within he PA according to
DE 196 51 075.9-53, DE l9 54 8$6.2-53 and in particular DE
NY01 007120 v 1 ~ 5 0

CA 02458199 2004-02-13
199 26 538.0 (Figures 5, 11~, 16, 17, 23, 30. 31, 33)).
.Analysis of the computatio times may be performed either in
the compiler at tz'anslatio time according to the following
section and/ox it may be m asured empirically at the same or a
run time to induce a subse cent optimization. which results in
a compiler capable of lean ing, in particular a self-learning
compiler.
Analytical methods and par~lleli2ation methods are important
for the present invention.
Various methods according o the related art are available ~or
analysis and implementation. of the parallelization.
A preferred method is described below.
Functionp to be mapped are represented by' graphs (sec PACTI3f
DE 199 26 538.01, an appli ation may be composed of any number
of different functions. Th graphs are analyzed for the
parallelism they contain; 1,1 methods of optimization may be
used in advance.
For example, Lhe fvllowing~testa are to b~ performed:
ILP expresses which instr ci:ions may be executed
simultaneously (see PAR~~). Such an analysis is easily
possible on the basis of he consideration of dependencies of
nodes in a graph. Corresp nding methods axe known sufficiently
well from Che related art and mathematics per se, and
retercnce shall be made t VLIW compilers and synthesis tools
as examples.
NY01 66712H v 1 ~ 51

CA 02458199 2004-02-13
8owever, conditional execu ions (IF), interlinked if
neccasary, for example, de erve special attention because a
correct statement regardiri Lhe paths to be executed in
parallel is often impossib a or hardly possible because there
1s a strvag dependeaee on he value space of the individual
parameters, which is often known only inadequately or not at
all. Furthermore, an exact analysis may take eo much
computation time that it m y no longer be performed,
reasonably.
In such cases, the arialysi~ may be simplified by iaetxuctions
from the programmer, fvr a ample, and/or it is possible to
operaCe on the basis of oo ~eepondixzg compiler swltChes so
that in case of doubt eith ,r high paxallelizability is to be
assumed (possibly squander ng resources) or low
parallelizabiliLy is to be assumed (possibly squandering
performance) .
Likewise in these cases, n empirical analysis may be
performed at run time. Ac ording to PACT10, pACTl7, methods
are knvwzz which make it p ssibl~ to compile statistics
regarding the program beh vior at run time. It is thus
possible to initially ass me maximum parallelizability, for
example. The individual p the return messages to a statistics
z5 unit (e.g., implemented i a CT or some other stage, see
PACT10, FACT17, but basic lly also units aCCOrdiag to PACT04
may be used) regarding ca h run cycle. t7rsing dtatistieal
measures, it is now pvssi lc to analyse which paths are
actually run through. In dditien, there is also the
30 possibility of evaluating which paths are run through in
parallel frequently yr ra ely yr never, based on the data at
run time. This type of pa h usage message is not obligatory
but is advantageous.
NY01 8671 Z6 v 1 ~ 5 2

CA 02458199 2004-02-13
Accordingly, the execution may be optimized in a next program
call. It should be pointed out that the statistics information
may not be written away in parLlCUlar in a nonvolatile manner
as An a hard drive. IL is nvwn from PRCT22, PACT24 that a
plurality of configuration may either be configured
simultancoualy and then tr ggered by triggers (PACT06) or only
a subset may be configured and Lhe remaining configurations
being reloaded as necessar by mer~.dir~,g the corresponding
triggers so a loading unit (CT, PACT10).
to
The value PAx2(p) which is ~.sed below indicates for the purpose
of clarificaClon which par llelism is achievable on an
instruction level, i.e., h w many ILPs may be achieved at a
certain stage (p) within t a data flow graph transformed trom
1.5 the function (Figure 7a).
Vec>~or parallelism is also important (see VEC{}l. VeCLOr
parallelism ie useful whe large volumes of daCa must be
processed. In this case, lii~ear sequencca of opexationra are
20 veetorizable, i.e., all o erations may process data at the
same time, typically with each separate operation processing a
separate data word.
This procedure is impossik~le to come ext~nt within loops.
25 Therefore analyses and optimizations are necessary. For
example, the graph of a f netiori may be expressed by a Petri
network. Petri networks h ve the property that the forwarding
of results is controlled y nodes so that: loops may be
modeled, for example. The data throughput is determined by the
30 Feedback of the result in a loop. Examples:
~ The result of com~tation ri is needed fvr Computation
n + 1: only one computation may be executed within th~
1008.
The result of computation n ie needed for coc~utaLion
NY01 687128 v 1 ~ 5 3

CA 02458199 2004-02-13
I
n + m: m - ~. computations may be executed within the
loop.
~ The result determin~~a the abort Of the loop, but it
does not enter intothe computation of the results: no
i
f~edback is neCessa~y. Although incorrect values (too
many values) enter the loop under some circumstances,
the output of results may be interrupted. directly on
reaching the final Fondition at the end of th~ loop.
Before the analysis of loos, they may be optimized according
to the related art. For example, certain inStruCCions may be
extracted from the loop anqi placed before or after the loop.
I
Value VEC which is used be~.ow ~or clarification may illustrate
i5 the degree of vectorizabil~.ty of a function. In other words,
VEC indicates how marry data words may be processed
simultaneously within a qujantity of operations. Fox example,
VEC may be Computed from the number of required aritrimetic
means for a function n"oa~e end data na,~~ calculable within the
2 0 veCZOr at the same time , ~. g . , by VEC = nnod88~n~~~a
If a function may be mapped onto five arithmetic means (n~dQfi =
5), for examp7,e, and if d~.ta may be processed at the same time
in each of the arithmetic ~meanS (na~e~ = 5) , then VEC - 1
25 (Figure 7b).
However, if a function ma~ be mapp~ad onto five arithmetic
means (n"o~e - 5) and datamaY be processed in only one
arithmetic means, e.g.. based on feedback of the results of
3 o the piper ins back Lo the ~,nput 1 naat~ = s ) , then VEC = 1 / 5
(Figure 7c). i
vEC may be computed for an entire function and/er for partial
details of a function. Foir the compiler according to the
NY01 667128 V ~ . s 4
i
i
.I .

CA 02458199 2004-02-13
present invention, both variants may be advantageous, as is
the case in general for determination and analysis of vEC.
According to Figure 7a, PAR(p) is determined for each line of
a graph, as is possible in an advantageous manner. A line of a
graph is defined in that it is execuCed within a clock unit.
The number of operations del7ends on the implementation of the
particular VPU.
If PAR(p) corresponds to the number of nodes in line p, then
all the nodes may be executed in parallel.
Tt PAR(p) is smaller, certain nodes are executed only
alternatively. The alternative executions of one node are each
combined in one PAE. A selection device permits activation of
the alternative corresponding to the status of data processing
at run time as described in PACT08, for example.
VEC is also assigned to e3Ch line of a graph. If vEC - 1 ~or
one line, this means that the line xemains as a pipeline
stage. If one line is smaller than 1, then all the following
lines which are also smal~.er than 1 are Combined because
pipelining is impossible. According to the saquence~of
operations, they are combined to yield a sequence which is
then Corifigured,into a PAE and is processed Sequeritial7.y at
run time. Corresponding methods are known from pCT/DE 97/02949
and/or PCT/DE 97/02998, for example.
With the method described. here, any complex parallel processor
models may be constructed by groups of sequencers. Tn
particular, sequencer structures rnay be generated fox mapping
reentrant code.
The eynchronizations required to accomplish this may be
NY01 869128 v 1 5 S

CA 02458199 2004-02-13
implemer~tad, for example, by the time stamg method described
in PACT18 ox preferably by the triggering method described in
PACT08.
If a plurality of sequencer~ or sequential parts is mapped
veto a PA, then it ie pref~rable to coordinate the power of
individual sequencers for power consumption reaeor~s. This may
be accomplished in a particularly pr~ferable manner by
adapting the working frequenci~s of the sec~ueriCers Co one
another. For example, methods which allow individual vlocking
of individual PAEs Or groups Of FAEs are known from PACT25 and
PACT18.
The frequency of a sequencer may be determined on the basis of
the number of Cycles typically needed for procearaing the
function assigned to it.
~'ar example, if it needs five clock cycles for processing its
function while the remaining system needs exactly cne clock
cycle to process assigned tasks, then its clocking should be
five times higher Lfaotor?~ than the clocking of the remaining
system. In the case of a pluraJ~ity of sequencers, different
aleck cycles are possible. A clock multiplication and/or a
clock division may also be provided.
Functions are partitioned according to the aforementioned
method. zn partitioning, memories for data and relevant status
are inserted accordingly. Additional alternative and/or moxe
extensive methods are known from PAC2'13 and PACT16.
According to PACT01, PACT10, PACT13, PACT17, PACT22, PACT24,
some VPUs offer the possibility of difgerential
reconfiguration. Thin may be used if only relatively few
changes are necessary within the Configuration of PAEe in a
NY01 687128 v ~ 5 6
:z

CA 02458199 2004-02-13
reconfiguration. In other words, only the changes in a
configuration ~.n comparison with the instantaneous
configuration are reconfigured. The partitioning in this case
may be such that the configuration (dif~erential) following a
configuration contains only the required reconfiguration data
arid does not constitute a complete Configuration. The compiler
of the present invention is preferably designed to recognize
and support this.
Scheduling of the reconfiguration may take place through the
status which reports functions) to a loading unit (CT) which
in turn selects on the bas~.s of the incoming status the next
configuration or partial configuration and Configures it. In
detail, such methods are known froltv PACT01, PACT05, PACT7.0,
PACT13, 8ACT17.
In addiCion, the scheduling may support the possibility of
preloading configuratior~.s during run time of another
configuration. A plurality of configurations may also be
preloaded speculatively, i.e., wi.thout being sure that the
configurations are needed at all. This is particularly
preferred when the CT is at least largely unleaded and in
particular has little or no task load, e.g., in the case of
lengthy data streams that are proceeaable without
configuration . Due to the eel~ction mechanisms such as those
described according to DE 197 04 728.9, for example, the
configurations to be used are then selected at run time (see
also the example NLS in PACT22/24).
Likewise the local sequancer may be controlled by the statue
of their data processing as is known, for example, ~rom
DE 195 51 075.9-53, DE 196 54 A46.2~53, DE 199 26 538Ø To
perform their recon.figurat~.on, another dependent yr
independent status may be reported to the CT (ees pACT04,
NY01 887128 V 1 5 7

CA 02458199 2004-02-13
LLBACK, far example).
The present invention will now be described with reference to
additional figures, where the follov~ring characters are used
below to simplify the notation: V or, n and.
Figure 8a shows the graph according tv Figure 7a mapped on a
group of PREs with maximum achievable parallelism. All
operations (iriSLruCLionB i1-i12) are mapped into Individual
PAEs.
Figure 8b shows the same graphs, e.g., with maximum usable
vectoxizability, but Lhe quantity of operations V2~{i1, i3},
V3=ti4, 15, 16, i'7, i9~, v~4r{i9, i10, ix1} is not parallel par
(par({2,3,4}) ~ 1). This makes it possible to save on
resources by assigning a quantity P2, P3, P4 of op~rations to
one H~-2E_ A status signal tvr each data word in each stage
selects the operation to be exeouted in the particular BAE.
The PAEs arc networked as a pipeline (vector) and each PAE
executes one operation via a different data word per each
clock cycle.
This yields the following sequence:
PAE1 computes data arid serid5 1L to PAE2. Together with the
data, it forwards a staLUS signal vu~hich indiaatea whether il
or i2 is tv be executed.
PAL2 Continues to eomput~ the data of PAE1. According to the
incoming statue signal, the operation Lo be executed (il, i2)
is selected and computed. According to the computation, FAE2
3o forwards a staCuC signal to PAE3, indicating whether
(i4 V i5) V (i6 V i7 V i8) ie to be executed.
PAE3 continues to compute the data of PAE2. According to the
incoming status signal. the opexativn to be executed
(i4 V i5l V (i6 V i7 V ie) is Selected and eemputed. According
NY01667128 v 1 ~ 5 8

CA 02458199 2004-02-13
to the computation, PAE3 forwards a statue signal to PAE~,
indicating whether i9 V i10 V 111 is to be executed.
PF1E4 continues to computer the data of PAE3. According to the
incoming statue signal, the operation to be executed
i9 V i10 V ill is selected and computed.
P1~E5 continues to compute the data of PAE4.
one possible corresponding method and hardware which allow a
parlricularl.y favorable implem~ntation of the method described
1o here are described in DE 197 0~ 728.9 (Figures 5 and 6).
PACT04 and PACT10, PACT13 also describe methods that may be
used in general but are more complex.
Figure Sc again shows the earns graph. In this example,
s5 vectorization is impossible but pAR(p) is nigh which means
that within one line a plurality of operativng may be execut~d
at the same time. The operations to be implemented in parallel
are P2 = {i1 ~ i2~. P3 - {i4 ~ i5 ~ i6 ~ i7 n 18), P4 =
{i9 n i10 n i11}. The PAEs are networked eo that they may
20 exchange any data among themselves in any way. The individual
PAES perform operations only when there ins an ILP in the
corresponding cycle; otherwise they behave neuCrally (NOP),
and there may optionally be a clacking down and/or a clock
shutdown and/or a current shutdown to minimize power loss.
The following sequence is provided:
In the first cycle, only PAE2 op~rates and forwards the data
to PAE2 and PAE3.
In the second cycle, PAE2 and PAE3 arc operating in para11e1
and forward their data to PAE1, PAE2, PAE3, PAEa, 1~AE5.
NY01 807128 v 1 5 9

CA 02458199 2004-02-13
In Lhe third cycle, PAE1, PA~2, pAE3, PAE4 and PAE5 are
operating and they forward the data to PAE2, pAF3 and PAES.
In the fourth cycle, PAE2, PAE3 ar~.d PAES are operatinc,~ arid
forward the data to PAE2.
In the fifth cycle, only PAE2 is operating.
This function thus requires five cycles for the computation.
The corresponding gequencer should thus operate with a five-
fold clvek cycle in r~lation tc its environment in order to
achieve a corresponding perfox'mance.
A possible Corresponding method is described in PACT02
(Figures 19. 20 and 21). PACT04 and PACT10, PACT 13 also
describe methods that may be used in general but are more
complex. Additional methods and/or hardware may also be used.
Figure 8d shows they graphs according to Figure 7a in case
there is no usable parallelism. 2o compute a data Word, eaoh
stage must be run through in succession. Within the stages
only one of the branches is always processed.
The function also requires five cycles fvr the computation:
oyl ~ (i1), cy2 = (i2 V i3), cy3 - (i4 V i5 V i6 V i~ V i8),
cy4 = ( 1 9 v 1z o V i 11 ) , cy5 - ( ia.2 ) . Thus the corresponding
sequencer should work with a clock cycle five times faster in
relation to its en~rironment to achieve a correepondir~g
performance .
Such a function is mappable, for example, as in figure Bc by
using a simpl~r saquencer according to PACT02 (Figures 19, 20
arid 21), PACT04 and PACT10, 7.3 also describe mQthods that maY
be used in general but are more complex.
NY01 8671 Z8 v ~

CA 02458199 2004-02-13
The illustrations represented in Figure 8 may be combined and
grouped at will.
In Figure 9a, for example, the same function is shown in which
the paths (i2 ~ (i4 v i5) n i9) and (i3 ~ (i6 V i7 v i8) n (i9
v i10)) are executable in parallel. (i4 V i5), (i6 V i7 V i8),
(i9 v i10) are each alternatives. The function ie also
vectorizable. IL is thus poss~.ble to construct a pipeliile iri
which the particular ~uncti.an to be executed is determined on
the basis of etatug signals for three PAES (PAE4, PAES, PAE7).
8igure 9b shows a similar example in which vectorization is
impossible. However, the paths (i1, ~ i2 ~ (i4 V i5) ~ i9 ~ 112)
and (i3 n (i6 v i7 v is) n (i1o v ill)) are parallel. Thus the
optimum performance may be achieved with the use o~ two PAEs
which also process the parallel paths in parallel. The FREs
are synchronized with one another by status signals which are
generated preferably by pAE1 because it Computes the start
(i1) and the ~1zd (i12) of the function.
It should be pointed out in particular that a symmetrically
parallel processor model (SMP) or the like could yield the
multiprocessor models in use today by a multiple configuration
of sequenCers.
It should also be poir~.ted out that all configuration registers
for scheduling may also be loaded with riew~con~igurations in
the baekgrourid and/or duri~.g data processing.
This is possible, for example, if the hardware has the design
known ~rom D8 196 51 075.9-53. Independent memory areas or
registers are then available and may be addressed
independently. It is possible to jump to certain locations due
to incoming triggers; jump instruCtioris (JMP, CALL/RET) ar~
NY01 66718 V 1

CA 02458199 2004-02-13
also possible and may al~o be executed conditionally, i~
necessary.
According to DE 196 54 846.2-53, indepera.dent xead and Write
pointers are available eo that basically there is an
independence and thus the possibility of access in the
background. In particular, it is possible to segment the
memories, which provides additional independence. Jumps are
possible via jump instruetionc (JMP, CALL/RET) which may also
be executed conditionally if necessary.
According to DE 197 04 728.9, the individual registers which
may be selected by the triggers axe basically independent and
therefore allow independent eonfigux'aLion in particular in the
is background. Jumps within the register are not possible, and
the choice is made exclusively via the trigger vectors.
An essential factor for evaluating the efficiency of PAR and
vEC is th~ type of data processed by the particular structure.
For example, it is worthwhile to roll a structure, i.e., Lo
pipeline or parallelize a structure which processes a large
volume of data, ac is the Case w~,th video data or
telecommunications data, for example. It is not worthwhile to
roll structures which prowess low data volumes (e. g., keyboard
inpu'G, mouse, etc.); on ths~ contrary, they would bleck
resources for other algorithms.
Thus, on the basis of different instructions, it is proposed
that only the algorithms, structures or parts of algorithms
which appropriat~ly process large volumes of data shall be
parallelized and vectorized.
Such instructions might be, for example:
NY01 987128 v 9 6 2

CA 02458199 2004-02-13
1. The data type (on the basis of the higher data volume,
arrays, streams should be rolled sooner hart individual
charactexs, for example).
2. The type of access (linear program sequences should be
mapped into sequencers, fox example, whexeas it is worthwhile
to roll loops because of the high number of xun-throughs, for
example)_
l0 3. The type of source and/or the target (keyboard and mouse,
~or example, have a data rate that is too low to be rolled
efficiently, but the data rate with a network, for example,
and/or video sources or targets is much higher).
Any quantity of these instructions may be used for analysis.
7. Definition o:~ T~r.rme
Locally relevant state State that is relevant only
2o within a certain configuration;
Globally relevant state State that is relevant in
multiple cvnfiguratione and must
be exchanged among
configurations;
Relevant state State that is needed within an
algorithm for correct execution
th~reof and is thus described by
3a the algorithm and used by it;
Trrelevant state State that is not important for
the actual algorithm and is also
not described in the algorithm
NYO~ Bsr~2a Y ~

CA 02458199 2004-02-13
but i~ needed by Lhe executing
hardware. depending on the
implementation.
NY01 667128 v 1

Representative Drawing

Sorry, the representative drawing for patent document number 2458199 was not found.

Administrative Status

2024-08-01:As part of the Next Generation Patents (NGP) transition, the Canadian Patents Database (CPD) now contains a more detailed Event History, which replicates the Event Log of our new back-office solution.

Please note that "Inactive:" events refers to events no longer in use in our new back-office solution.

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Event History , Maintenance Fee and Payment History should be consulted.

Event History

Description	Date
Inactive: IPC expired	2020-01-01
Inactive: IPC expired	2018-01-01
Inactive: IPC expired	2018-01-01
Inactive: Dead - No reply to Office letter	2006-05-15
Application Not Reinstated by Deadline	2006-05-15
Inactive: IPC from MCD	2006-03-12
Inactive: IPC from MCD	2006-03-12
Inactive: IPC from MCD	2006-03-12
Inactive: IPC from MCD	2006-03-12
Deemed Abandoned - Failure to Respond to Maintenance Fee Notice	2005-08-16
Inactive: Status info is complete as of Log entry date	2005-06-29
Inactive: Abandoned - No reply to Office letter	2005-05-16
Inactive: Courtesy letter - Evidence	2004-06-08
Inactive: Cover page published	2004-06-07
Inactive: Notice - National entry - No RFE	2004-06-03
Inactive: First IPC assigned	2004-06-03
Application Received - PCT	2004-03-23
National Entry Requirements Determined Compliant	2004-02-13
Application Published (Open to Public Inspection)	2003-02-27

Abandonment History

Abandonment Date	Reason	Reinstatement Date
2005-08-16

Maintenance Fee

The last payment was received on 2004-02-13

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

the reinstatement fee;
the late payment fee; or
additional fee to reverse deemed expiry.

Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Fee History

Fee Type	Anniversary Year	Due Date	Paid Date
Basic national fee - standard			2004-02-13
MF (application, 2nd anniv.) - standard	02	2004-08-16	2004-02-13

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
PACT XPP TECHNOLOGIES AG

Past Owners on Record
ARMIN NUECKEL
FRANK MAY
MARTIN VORBACH

Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.

Documents

To view selected files, please enter reCAPTCHA code :

To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Filter

Download Selected in PDF format (Zip Archive)

Download Selected as Single PDF

Document Description	Date (yyyy-mm-dd)	Number of pages	Size of Image (KB)
Description	2004-02-13	64	2,592
Claims	2004-02-13	5	191
Abstract	2004-02-13	1	6
Cover Page	2004-06-07	2	32
Drawings	2004-02-13	13	788
Notice of National Entry	2004-06-03	1	192
Request for evidence or missing transfer	2005-02-15	1	101
Courtesy - Abandonment Letter (Office letter)	2005-06-27	1	166
Courtesy - Abandonment Letter (Maintenance Fee)	2005-10-11	1	176
PCT	2004-02-13	3	166
Correspondence	2004-06-03	1	26

Language selection

Menus

English Abstract

French Abstract

Event History

Abandonment History

Maintenance Fee

Fee History

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.

Patent 2458199 Summary

English Abstract

French Abstract

Event History

Abandonment History

Maintenance Fee

Fee History

Your request is in progress.Requested information will be availablein a moment.Thank you for waiting.

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.