Sign in

username:

password:



Not a member?

Search fpga-cpu



Search tips

Subscribe to fpga-cpu



fpga-cpu by Keywords

Altera | CISCifying | IDE | ISA | Java | JHDL | JTAG | LBU | MicroBlaze | PAR | PCI | RISC | SoC | Spartan | Transputers | Verilog | VHDL | Virtex | VLIW | WebPack | Xilinx | Xsoc | YARD-1A

Discussion Groups

Discussion Groups | FPGA-CPU | Re: VLIW processors anyone ?

This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).

VLIW processors anyone ? - rtfinch35 - Mar 11 8:42:00 2002


I want to do a VLIW processor in an FPGA, but I'm not confident what
I'm doing. Does anyone have links to sample (educational) VLIW
processor designs ? I've searched the net but can't seem to find
anything.

Current design: http://www.birdcomputer.ca/blackbird.htm

Thanks
Rob





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: VLIW processors anyone ? - Ben Franchuk - Mar 11 11:10:00 2002

rtfinch35 wrote:
>
> I want to do a VLIW processor in an FPGA, but I'm not confident what
> I'm doing. Does anyone have links to sample (educational) VLIW
> processor designs ? I've searched the net but can't seem to find
> anything.
>
> Current design: http://www.birdcomputer.ca/blackbird.htm

Why not a micro-code designed cpu? Since internal memory is faster than
external memory why not go back to CISC's style architecture rather than
say a RISC emulator. The instruction set could be raw micro-code (like
risc)
or one of 3 user configured tables. A fourth table is used for MMU/OS
stuff.
Here would be a nice CPU to emulate.
http://www.spies.com/~aek/alto/index.html
--
Ben Franchuk - Dawn * 12/24 bit cpu *
www.jetnet.ab.ca/users/bfranchuk/index.html





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: VLIW processors anyone ? - Jan Gray - Mar 11 11:29:00 2002

> I want to do a VLIW processor in an FPGA, but I'm not confident what
> I'm doing. Does anyone have links to sample (educational) VLIW
> processor designs ? I've searched the net but can't seem to find
> anything.

INTRODUCTION TO VLIW

VLIW ::= very long instruction word: a machine with one instruction
pointer (how's that, Reinoud? :-) ) that selects a long instruction word
that issues a plurality of operations to a plurality of function units
each cycle. In a VLIW system, the compiler schedules the multiple
operations into instructions; in contrast, in a superscalar RISC, the
instruction issue hardware schedules one or more scalar instructions to
issue at the same time. You may also see the term LIW vs. VLIW. I
think of an LIW as a 2 or 3 issue machine, a VLIW as a 3+ issue machine.
In some VLIWs, to improve code density, a variable number of operations
can be issued each cycle.

There's a rich literature on VLIWs (that I don't purport to have read),
but be sure to see: John R. Ellis, Bulldog: A Compiler for VLIW
Architectures.

The IA64 EPIC "explicitly parallel instruction computing" architecture
is the most notable VLIW derivative. Several DSPs like the TI C6x are
LIWs. A famous early VLIW was the Multiflow Trace family, a
supercomputer startup that lived approximately 1984-1990.

The challenging part of a VLIW project is the compiler. Unless you're a
glutton for tricky assembly programming, or have a very small and
specific problem, it's hardly worth designing the hardware if you don't
have a way to compile code to use the parallelism presented by the
hardware. Indeed, some LIW design suites (IIRC the Philips LIFE) allow
you to pour C code through a compiler and simulator and configure the
number and kind of function units to best trade off performance and area
against the specific computation kernel you care about.
Anyway, on to FPGA VLIW CPU issues. Here are some quick comments.

To summarize my opinions, is a 2-issue machine is straightforward and
even worthwhile, whereas (depending upon computation kernel) a 4+ issue
machine may well bog down in its multiplexers and therefore may not make
the best use of FPGA resources.

INSTRUCTION FETCH AND ISSUE

No sweat. Store instructions in a block RAM based instruction cache.
Each BRAM in Virtex derived architectures can read 2x16=32 bits each
cycle; each BRAM in Virtex-II can read 2x36=72 bits each cycle. Use 2-4
BRAMs and you can easily fetch 100-200 bits of instruction word each
cycle.

Keeping the I-cache fed is another matter, left as an exercise.

REGISTER FILE DESIGN

The key design problem is the register file organization. If you are
going to issue n operations each cycle, you must fetch 1-2*n operands
and retire up to n results each cycle. That implies a 1-2*n-read
n-write per cycle register file.

As a rule of thumb, in a modern LUT based FPGA, if a pipelined ALU
operation requires about T ns, then you can read or write to a single
port LUT RAM (or read AND write to a dual port LUT RAM) in about T/2 ns.
(In Virtex-II, T is about 5 ns).

Let's discuss an n=4 operation machine. It is a challenge to retire 4
results to a LUT RAM based RF in T ns. Sure, if the cycle time balloons
to 2T, you can sequentially retire all four results in 4*T/2 ns, and
read operands in parallel using dual port LUT RAM.

Thus in a fully general organization, where any of the four results can
retire to any four registers, "it can't be done" in T ns as defined
above.

Instead, consider a partitioned register file design. For instance,
imagine a 64-register machine partitioned into 4 banks of 16 registers
each. Each cycle, one result can be retired into each bank. That is
easily targeted to a LUT RAM implementation. Indeed, you can fix the
machine such that each issue slot is associated with one write port on
the register file.

Say bank m is assigned registers i, i div 4 == m, with 0<=i<64, 0<=m<4.
We can easily issue four independent instructions such as
r0=r1+r2; r16=r17+r18; r32=r33+r34; r48=r49+r50

One $64 design question, properly answerable only with a good compiler
at your side, is how to then arrange the read ports. At one extreme, if
each of the four banks is entirely isolated from the other, (like a
left-brain/right-brain patient with their corpus collosum cut), then you
only need 2 read ports on each bank. In this organization, if you need
a register or two from another bank, you would typically burn an issue
slot in that sourcing bank to read the registers, and maybe burn another
slot in the destination bank to save the registers. (Alternately in the
destination bank you can directly mux it into the dependent operation
register).

At the other extreme, if any of the four banks can with full generality
read operands from any of the other four banks, e.g.

r0=r1+r2; r16=r3+r4; r32=r5+r6; r48=r7+r8 // cycle 1
r0=r17+r18; r16=r19+r20; r32=r21+r22; r48=r23+r24; // cycle 2
r0=r33+r34; r16=r35+r36; r32=r37+r38; r48=r39+r40; // cycle 3
r0=r49+r50; r16=r51+r52; r32=r53+r54; r48=r55+r56 // cycle 4

Then each of the four banks would need to provide 8 read ports, and each
of the 8 operand multiplexers (in front of the four ALUs) would need at
least a 4-1 mux.

You can of course provide all these ports with replicating or
multi-cycling the register file LUT RAMs, but it won't be pretty.

In practice, I believe that the expected degree of cross-bank register
reading is limited. Maybe you typically only need 3 read ports per
bank, perhaps 1 1/2 for that bank's ALU and perhaps 1 1/2 for other
slots' accesses. But you have to have a scheduling compiler to help you
make these tradeoffs.

By the way, in my 1994 brain dump
(http://fpgacpu.org/usenet/homebrew.html), with the sketch of the NVLIW1
2-issue LIW, I used exactly this partitioned register file technique,
combining two banks of 3r-1w register files. For each issue slot, one
operand was required to be a register in that bank; but the other
operand could be a register in either bank.

Another option is to combine multibanked registers with some heavily
multiported registers. For instance, assume each issue slot can read or
write 16 bank registers and 16 shared registers, subject across issue
slots to some maximum number of shared register file read and write port
accesses each cycle. The designers of the Sun MAJC used a similar
approach.

EXECUTION UNIT

Assume a design with 4 ALUs, one branch unit, one memory port.

Loads/stores: Perhaps you limit generality and only allow loads/stores
to issue from a given slot (and then shuttle those results to other slot
register banks using the above multiported reg file description). The
general alternative, allowing any slot to issue loads/stores, requires
you to mux any ALU output to be the ea, and a per-bank mux to allow the
MDR input to be the result.

Result forwarding (register file bypass): if an operation in slot i,
cycle 1, is permitted with full generality, to consume the result of an
operation from slot j, cycle 0, then you need n copies of an n:1 result
forwarding mux. Again, this is very expensive, so you will be sorely
tempted to reduce the generality of result forwarding, or eliminate it
entirely. Again, this is a design tradeoff your compiler must help you
select.

CONTROL FLOW

You want to reduce the number of branches. In average C code you can
expect a branch every 5 (rule of thumb) generic RISC instructions or so.
In a 4 issue machine you will spend your life branching.

Instead, you will want to use some form of predicated execution. Some
computations will set predicate bits that represent the outcomes of
conditional tests. Then other instructions' execution (result
write-back and other side effects) will be predicated upon these
specified predicate bits.

In this organization, it seems reasonable to allow any issue slot to
establish predicates that can be used by any other issue slot; but for
simplicity you will only need and want one, or perhaps two, of the issue
slots to be able to issue (predicated) branch and jump instructions.

You will want a compiler to help you design the specific details of the
predication system.

IN SUMMARY

The sky's the limit. It is easy to build an 8- or 16- or 32-issue VLIW
in an FPGA. That's just stamping out more instances of the i-cache +
execution unit core slice. Whether the resulting machine runs much
faster than a much simpler 2-issue LIW is critically dependent upon your
compiler technology. ASIDE: WM

As an aside, I'll just briefly mention Wulf's WM machine
(http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with
many interesting features, but of particular interest here, each execute
instruction is of the form
rd = (rs1 op rs2) op rs3 ,
an explicitly parallel representation that affords some additional
parallelism over the canonical scalar RISC, but which does not increase
the number of write ports required on the register file.

Jan Gray
Gray Research LLC





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: VLIW processors anyone ? - Jan Gray - Mar 11 11:48:00 2002

> Current design: http://www.birdcomputer.ca/blackbird.htm

I see this design reflects many of the issues in my last message. I
should read first, write second!

Jan Gray
Gray Research LLC




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: VLIW processors anyone ? - rtfinch35 - Mar 11 21:46:00 2002

>
> The challenging part of a VLIW project is the compiler. Unless
you're a
> glutton for tricky assembly programming, or have a very small and
> specific problem, it's hardly worth designing the hardware if you
don't I'm planning on using the Transmeta approach. I can write a simple
virtual instruction set to real instruction set translater. At first
I'm aiming for a direct 1:1 translation that won't provide any
speedup (in fact it'll slow things down because of the wider
instruction word). As the translater gets better, performance will
improve. (Perhaps I could use the xr16 ISA as the virtual instruction
set? :) > Anyway, on to FPGA VLIW CPU issues. Here are some quick comments.
>
> To summarize my opinions, is a 2-issue machine is straightforward
and
> even worthwhile, whereas (depending upon computation kernel) a 4+
issue
> machine may well bog down in its multiplexers and therefore may not
make
> the best use of FPGA resources. I quickly put together a datapath and it looks like the datapath
takes about 1/3 of a SpartanII 2S200 and can run close to 50MHz, of
course this is without all the other extras needed to make it work,
and assuming I got it right (a pretty big assumption). The thing that
appears to slow it down is actually the barrel shifter which I may
get rid of by limiting shifts to 0 to 3 bits at once. I think I can
get a peak instruction completion rate of five instructions per cycle
(I admit a little contrived) but it makes for great marketing. (eg.
1) a compare (target = condition codes), 2) a store (no target reg),
3) an effective address calc (target = special addr reg) 4) a branch,
5) some alu operation 6) maybe ? Completing five instructions per
cycle at 50MHz would give me a 250 MIPs processor... Well I can dream
can't I ? > INSTRUCTION FETCH AND ISSUE
>
> No sweat. Store instructions in a block RAM based instruction >

Planning on using a 96 bit (or possibly 128 bit) wide i-cache, with a
32 bit memory interface. > and retire up to n results each cycle. That implies a 1-2*n-read
> n-write per cycle register file.

Um, I think you need about half as many write ports because only
about 50% of instructions actually update registers. > CONTROL FLOW
>
> Instead, you will want to use some form of predicated execution.

I was thinking of using branch prediction.

> ASIDE: WM
>
> As an aside, I'll just briefly mention Wulf's WM machine
> (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with

I tried to access some of the docs but the links seem to be broken :(

> instruction is of the form
> rd = (rs1 op rs2) op rs3 ,

If the operations are stacked like this, wouldn't it increase the
cycle time of the processor ?

Rob





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Re: VLIW processors anyone ? - Jan Gray - Mar 12 12:55:00 2002

> I'm planning on using the Transmeta approach. I can write a simple
> virtual instruction set to real instruction set translater. At first
> I'm aiming for a direct 1:1 translation that won't provide any
> speedup (in fact it'll slow things down because of the wider
> instruction word). As the translater gets better, performance will
> improve. (Perhaps I could use the xr16 ISA as the virtual instruction
> set? :)

Be our guest...

> I quickly put together a datapath and it looks like the datapath
> takes about 1/3 of a SpartanII 2S200 and can run close to 50MHz, of
> course this is without all the other extras needed to make it work,
> and assuming I got it right (a pretty big assumption). The thing that
> appears to slow it down is actually the barrel shifter which I may
> get rid of by limiting shifts to 0 to 3 bits at once. I think I can
> get a peak instruction completion rate of five instructions per cycle
> (I admit a little contrived) but it makes for great marketing. (eg.
> 1) a compare (target = condition codes), 2) a store (no target reg),
> 3) an effective address calc (target = special addr reg) 4) a branch,
> 5) some alu operation 6) maybe ? Completing five instructions per
> cycle at 50MHz would give me a 250 MIPs processor... Well I can dream
> can't I ?

What matters:
* wall clock time
* total energy consumed by the computation
* FPGA area
* dev tools ease of use.

What doesn't matter:
* aggregate MIPS

:-) See e.g. http://fpgacpu.org/usenet/array.html for a (useless) 10^13
ops/s machine.

> > and retire up to n results each cycle. That implies a 1-2*n-read
> > n-write per cycle register file.

> Um, I think you need about half as many write ports because only
> about 50% of instructions actually update registers.

Certainly but it depends upon whether and how you handle the worst case
cases, whether you have multiplexers between ALUs and write port(s),
etc.

> > CONTROL FLOW
> >
> > Instead, you will want to use some form of predicated execution.
>
> I was thinking of using branch prediction.

That will help. So you only fetch useless instructions on a mispredict?

> > ASIDE: WM
> >
> > As an aside, I'll just briefly mention Wulf's WM machine
> > (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with
>
> I tried to access some of the docs but the links seem to be broken :(

Hmm. I have the ACM SIGARCH paper on it around here somewhere. The two
big architectural ideas are the two ops per instruction and the implicit
streaming memory ports exposed as special purpose registers.

> > instruction is of the form
> > rd = (rs1 op rs2) op rs3 ,
>
> If the operations are stacked like this, wouldn't it increase the
> cycle time of the processor ?

No and no.

First, IIRC, WM does the second op in a second pipeline stage.

Second, as I first learned during the Sun SuperSPARC presentation at Hot
Chips III (1991) (I think it was), the delay through two ALUs (a+b)+c is
not necessarily the same as two times the delay through one ALU (a+b),
since there is concurrency -- lower bits of (a+b) can be added to lower
bits of c even as upper bits of (a+b) are still ripple-carry-settling.

<Pause while I go and run some tests.>

OK, here's are two quick and dirty test circuits, sans floorplanning:

module onealu(clk, i, o);
input clk, i;
output o;
reg o;

reg [31:0] a, b, sum;
reg z;

always @(posedge clk) begin
{a,b} <= {a,b,i};
sum <= a+b;
z <= sum == 0;
o <= z;
end
endmodule module twoalus(clk, i, o);
input clk, i;
output o;
reg o;

reg [31:0] a, b, c, sum;
reg z;

always @(posedge clk) begin
{a,b,c} <= {a,b,c,i};
sum <= (a+b)+c;
z <= sum == 0;
o <= z;
end
endmodule

'onealu' approximates the delay from pipeline registers, through an
adder, and back into pipeline registers.
'twoalu' approximates the same thing, but through two adders.

Here are the TRCE numbers for Virtex-II -5:

onealu:
Slack: -0.237ns (requirement - (data path - negative
clock skew))
Source: b[0]
Destination: sum[31]
Requirement: 4.500ns
Data Path Delay: 4.737ns (Levels of Logic = 18)
Negative Clock Skew: 0.000ns
Source Clock: clk_c rising at 0.000ns
Destination Clock: clk_c rising at 4.500ns
Timing Improvement Wizard
Data Path: b[0] to sum[31]
Location Delay type Delay(ns) Physical Resource
Logical
Resource(s)
-------------------------------------------------
-------------------
L3.IQ1 Tiockiq 0.598 i
b[0]
SLICE_X0Y0.F1 net (fanout=2) 0.516 b[0]
SLICE_X0Y0.COUT Topcyf 0.755 sum[0]
sum_1_axb_0
sum_1_cry_0
sum_1_cry_1
SLICE_X0Y1.CIN net (fanout=1) 0.000 sum_1_cry_1/O
SLICE_X0Y1.COUT Tbyp 0.092 sum[2]
SLICE_X0Y2.CIN net (fanout=1) 0.000 sum_1_cry_3/O
SLICE_X0Y2.COUT Tbyp 0.092 sum[4]
SLICE_X0Y3.CIN net (fanout=1) 0.000 sum_1_cry_5/O
SLICE_X0Y3.COUT Tbyp 0.092 sum[6]
SLICE_X0Y4.CIN net (fanout=1) 0.000 sum_1_cry_7/O
SLICE_X0Y4.COUT Tbyp 0.092 sum[8]
SLICE_X0Y5.CIN net (fanout=1) 0.000 sum_1_cry_9/O
SLICE_X0Y5.COUT Tbyp 0.092 sum[10]
SLICE_X0Y6.CIN net (fanout=1) 0.000 sum_1_cry_11/O
SLICE_X0Y6.COUT Tbyp 0.092 sum[12]
SLICE_X0Y7.CIN net (fanout=1) 0.000 sum_1_cry_13/O
SLICE_X0Y7.COUT Tbyp 0.092 sum[14]
SLICE_X0Y8.CIN net (fanout=1) 0.000 sum_1_cry_15/O
SLICE_X0Y8.COUT Tbyp 0.092 sum[16]
SLICE_X0Y9.CIN net (fanout=1) 0.000 sum_1_cry_17/O
SLICE_X0Y9.COUT Tbyp 0.092 sum[18]
SLICE_X0Y10.CIN net (fanout=1) 0.000 sum_1_cry_19/O
SLICE_X0Y10.COUT Tbyp 0.092 sum[20]
SLICE_X0Y11.CIN net (fanout=1) 0.000 sum_1_cry_21/O
SLICE_X0Y11.COUT Tbyp 0.092 sum[22]
SLICE_X0Y12.CIN net (fanout=1) 0.000 sum_1_cry_23/O
SLICE_X0Y12.COUT Tbyp 0.092 sum[24]
SLICE_X0Y13.CIN net (fanout=1) 0.000 sum_1_cry_25/O
SLICE_X0Y13.COUT Tbyp 0.092 sum[26]
SLICE_X0Y14.CIN net (fanout=1) 0.000 sum_1_cry_27/O
SLICE_X0Y14.COUT Tbyp 0.092 sum[28]
SLICE_X0Y15.CIN net (fanout=1) 0.000 sum_1_cry_29/O
SLICE_X0Y15.Y Tciny 1.257 sum[30]
sum_1_cry_30
sum_1_s_31
SLICE_X0Y15.DY net (fanout=1) 0.001 sum_1[31]
SLICE_X0Y15.CLK Tdyck 0.322 sum[30]
sum[31]
-------------------------------------------------
---------------------------
Total 4.737ns (4.220ns logic,
0.517ns route)
(89.1% logic,
10.9% route)

twoalus:
Destination: sum[31]
Requirement: 6.000ns
Data Path Delay: 6.773ns (Levels of Logic = 18)
Negative Clock Skew: 0.000ns
Source Clock: clk_c rising at 0.000ns
Destination Clock: clk_c rising at 6.000ns
Timing Improvement Wizard
Data Path: a[2] to sum[31]
Location Delay type Delay(ns) Physical Resource
Logical
Resource(s)
-------------------------------------------------
-------------------
SLICE_X11Y0.YQ Tcko 0.493 a[3]
a[2]
SLICE_X8Y1.F1 net (fanout=2) 0.749 a[2]
SLICE_X8Y1.COUT Topcyf 0.668 sum_1_0[2]
sum_1_0_axb_2
sum_1_0_cry_2
sum_1_0_cry_3
SLICE_X8Y2.CIN net (fanout=1) 0.000 sum_1_0_cry_3/O
SLICE_X8Y2.COUT Tbyp 0.092 sum_1_0[4]
SLICE_X8Y3.CIN net (fanout=1) 0.000 sum_1_0_cry_5/O
SLICE_X8Y3.COUT Tbyp 0.092 sum_1_0[6]
SLICE_X8Y4.CIN net (fanout=1) 0.000 sum_1_0_cry_7/O
SLICE_X8Y4.COUT Tbyp 0.092 sum_1_0[8]
SLICE_X8Y5.CIN net (fanout=1) 0.000 sum_1_0_cry_9/O
SLICE_X8Y5.COUT Tbyp 0.092 sum_1_0[10]
SLICE_X8Y6.CIN net (fanout=1) 0.000 sum_1_0_cry_11/O
SLICE_X8Y6.COUT Tbyp 0.092 sum_1_0[12]
SLICE_X8Y7.CIN net (fanout=1) 0.000 sum_1_0_cry_13/O
SLICE_X8Y7.COUT Tbyp 0.092 sum_1_0[14]
SLICE_X8Y8.CIN net (fanout=1) 0.000 sum_1_0_cry_15/O
SLICE_X8Y8.COUT Tbyp 0.092 sum_1_0[16]
SLICE_X8Y9.CIN net (fanout=1) 0.000 sum_1_0_cry_17/O
SLICE_X8Y9.COUT Tbyp 0.092 sum_1_0[18]
SLICE_X8Y10.CIN net (fanout=1) 0.000 sum_1_0_cry_19/O
SLICE_X8Y10.COUT Tbyp 0.092 sum_1_0[20]
SLICE_X8Y11.CIN net (fanout=1) 0.000 sum_1_0_cry_21/O
SLICE_X8Y11.COUT Tbyp 0.092 sum_1_0[22]
SLICE_X8Y12.CIN net (fanout=1) 0.000 sum_1_0_cry_23/O
SLICE_X8Y12.COUT Tbyp 0.092 sum_1_0[24]
SLICE_X8Y13.CIN net (fanout=1) 0.000 sum_1_0_cry_25/O
SLICE_X8Y13.COUT Tbyp 0.092 sum_1_0[26]
SLICE_X8Y14.CIN net (fanout=1) 0.000 sum_1_0_cry_27/O
SLICE_X8Y14.Y Tciny 1.257 sum_1_0[28]
sum_1_0_cry_28
sum_1_0_s_29
SLICE_X9Y14.G2 net (fanout=1) 0.403 sum_1_0[29]
SLICE_X9Y14.COUT Topcyg 0.519 sum[28]
sum_1_1_axb_29
sum_1_1_cry_29
SLICE_X9Y15.CIN net (fanout=1) 0.000 sum_1_1_cry_29/O
SLICE_X9Y15.Y Tciny 1.257 sum[30]
sum_1_1_cry_30
sum_1_1_s_31
SLICE_X9Y15.DY net (fanout=1) 0.001 sum_1[31]
SLICE_X9Y15.CLK Tdyck 0.322 sum[30]
sum[31]
-------------------------------------------------
---------------------------
Total 6.773ns (5.620ns logic,
1.153ns route)
(83.0% logic,
17.0% route) 4.7 ns for one ALU in one pipeline stage
6.8 ns for two ALUs in one pipeline stage

Hmm, that's worse than I had expected. Looking above, the expensive bit
is the extra Topcyg and especially Tciny of the second adder. All we
are saving with the concurrency is the 15*Tbyp of about 1.4 ns.

Jan Gray, Gray Research LLC




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: Re: VLIW processors anyone ? - Campbell, John - Mar 12 13:36:00 2002

Interesting.
But why do both onealu and twoalus have 18 levels of
logic? > Data Path Delay: 4.737ns (Levels of Logic = 18)

> Data Path Delay: 6.773ns (Levels of Logic = 18)

I would have expected (probably naively) them to be 32
and 64, respectively. So what gives?
-jc





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

news - Alessandro Noriaki Ide - Mar 14 8:25:00 2002


Hi people, Does anybody know any site, continuously up to date,containing news about
FPGAs and reconfigurable computing ? thanks

=========================================

Alessandro Noriaki Ide
Mestrado em Ciencia da Computacao
Universidade Federal de Sao Carlos
http://www.dc.ufscar.br/~noriaki
e-mail:
uin: 46269985

==========================================





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: VLIW processors anyone ? - Reinoud - Mar 17 19:43:00 2002


Jan,

Excellent writeup; here's too late and too short a reply...

> The challenging part of a VLIW project is the compiler.

...and making compiler and hardware work together to keep hardware
bottlenecks like bypass (i.e. forwarding) network and register
file port scaling in check (for >4 ILP).

> To summarize my opinions, is a 2-issue machine is straightforward and
> even worthwhile, whereas (depending upon computation kernel) a 4+ issue
> machine may well bog down in its multiplexers and therefore may not make
> the best use of FPGA resources.

There are several ways around this. One obvious approach is
clustering (a cluster is basically a small VLIW section with full
bypassing and register ports, that is only loosely coupled with
other sections). Another less obvious approach is to explicitly
program the bypass busses (which allows for a huge reduction in
bypass _and_ register file port cost!). Hint, see:

http://ce-serv.et.tudelft.nl/MOVE/

(Disclaimer: that's where I work, although on an interesting and
mostly unpublished new arch variety.)

> REGISTER FILE DESIGN
>
> ...
>
> One $64 design question, properly answerable only with a good compiler
> at your side, is how to then arrange the read ports. At one extreme, if
> each of the four banks is entirely isolated from the other, (like a
> left-brain/right-brain patient with their corpus collosum cut), then you
> only need 2 read ports on each bank.

This probably won't work out well. Such isolation _does_ work quite
well for clusters within a wide VLIW, but within the cluster you'll
need more communication (we have an upcoming paper about this;).

> Result forwarding (register file bypass): if an operation in slot i,
> cycle 1, is permitted with full generality, to consume the result of an
> operation from slot j, cycle 0, then you need n copies of an n:1 result
> forwarding mux. Again, this is very expensive, so you will be sorely
> tempted to reduce the generality of result forwarding, or eliminate it
> entirely. Again, this is a design tradeoff your compiler must help you
> select.

Generality makes scheduling easier, but not much, and is certainly
not needed. The effects of clustering are fairly well documented,
and we've had excellent results with scheduling for more general
limited interconnect VLIWs.

> Instead, you will want to use some form of predicated execution. Some
> computations will set predicate bits that represent the outcomes of
> conditional tests. Then other instructions' execution (result
> write-back and other side effects) will be predicated upon these
> specified predicate bits.

This won't work well with your proposed single-write-port register
files. When selecting between two potential producers of some value,
with a single write port, you'll have to serialize the predicated
register writes... Which tends to hurt on a VLIW. Predicates on a
VLIW make most sense with multiple write ports.

- Reinoud

PS. Did you ever consider writing a book on FPGA-CPU's?






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: VLIW processors anyone ? - Jan Gray - Mar 18 10:46:00 2002

> Hint, see:
> http://ce-serv.et.tudelft.nl/MOVE/

Ah, move machines. I will take a look, thank you.

> At one extreme,
> > if each of the four banks is entirely isolated from the
> other, (like a
> > left-brain/right-brain patient with their corpus collosum
> cut), then
> > you only need 2 read ports on each bank.
>
> This probably won't work out well. Such isolation _does_
> work quite well for clusters within a wide VLIW, but within
> the cluster you'll need more communication (we have an
> upcoming paper about this;).

I agree, I also doubt it will work well, but I was just trying to sketch
the extremes of possibility.

> > Instead, you will want to use some form of predicated
> execution. Some
> > computations will set predicate bits that represent the outcomes of
> > conditional tests. Then other instructions' execution (result
> > write-back and other side effects) will be predicated upon these
> > specified predicate bits.
>
> This won't work well with your proposed single-write-port
> register files. When selecting between two potential
> producers of some value, with a single write port, you'll
> have to serialize the predicated register writes... Which
> tends to hurt on a VLIW. Predicates on a VLIW make most
> sense with multiple write ports.

I advocate 1- or 2-write *data* register files for LUT RAM
implementations. This is not a constraint for predicate register files.
N-read m-write predicate register files can easily be made from
assemblies of flip-flops, logic to drive CEs, and logic for output
multiplexers.

By the way, on my trip down to SF for ESC, I took along "Readings in
Computer Architecture", ed. by Mark Hill et al. It's a wonderful book,
most highly recommended, a compilation of 5 dozen important modern
computer architecture papers, well organized and set in context.
Apropos to this discussion is the paper "A Comparison of Full and
Partial Predicated Execution Support for ILP Processors" by Mahlke et
al, which compares fully predication architectures to traditional
non-predicated superscalars retrofitted with CMOVE (if (pred) rdest =
rsrc) and SELECT (rdest = pred ? rsrc1 : rsrc2). This paper provides a
nice introduction to ILP compiler support (superblocks and if-conversion
etc.) and how to use that to take e.g. a 14 cycle grep inner loop down
to 6 cycles in a 4-issue 1-branch VLIW.

By the way^2, this is the kind of compilation IA64 must do well to
compete with out-of-order superscalars. I suspect one of IA64's
greatest challenges will be the social engineering required to make
software developers change from
compile -optimized *.c
to
compile -profiling *.c
profile app against representative workloads
compile -profile-feedback-optimizing *.c
The latter is more difficult to get right and to institutionalize in
makefiles or what-have-you.

> PS. Did you ever consider writing a book on FPGA-CPU's?

Thank you very much for the compliment. '96-'00 I wrote a few book
outlines, and discussed the project with a leading publisher, but never
pushed the button.

"Writing is an adventure. To begin with, it is a toy and an amusement.
Then it becomes a mistress, then it becomes a master, then it becomes a
tyrant. The last phase is that just as you are about to be reconciled to
your servitude, you kill the monster and fling him to the public." --
Churchill.

The Circuit Cellar articles were an attempt to get the book bug out of
my system. Didn't work.

Jan Gray, Gray Research LLC





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: VLIW processors anyone ? - Reinoud - Apr 1 17:55:00 2002


Another rather late reply (just back from vacation :-).

"Jan Gray" <> wrote:
> > > Instead, you will want to use some form of predicated execution. Some
> > > computations will set predicate bits that represent the outcomes of
> > > conditional tests. Then other instructions' execution (result
> > > write-back and other side effects) will be predicated upon these
> > > specified predicate bits.
> >
> > This won't work well with your proposed single-write-port
> > register files. When selecting between two potential
> > producers of some value, with a single write port, you'll
> > have to serialize the predicated register writes... Which
> > tends to hurt on a VLIW. Predicates on a VLIW make most
> > sense with multiple write ports.
>
> I advocate 1- or 2-write *data* register files for LUT RAM
> implementations. This is not a constraint for predicate register files.

But I _was_ talking about data register files, not predicate register
files. Using predicates in combination with data register files with a
single write port will often force serialization of execution. For
example, try if-converting x=(c)?a:b;. With a single write port you
are forced to place two assignments to x in separate instruction
words. Select operations may be more effective than predicates for
such an architecture.

> "Writing is an adventure. To begin with, it is a toy and an amusement.
> Then it becomes a mistress, then it becomes a master, then it becomes a
> tyrant. The last phase is that just as you are about to be reconciled to
> your servitude, you kill the monster and fling him to the public." --
> Churchill.

Yeah, that's why I asked if you would consider writing one. I sure
wouldn't :-).

- Reinoud






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

RE: VLIW processors anyone ? - Jan Gray - Apr 1 18:31:00 2002

> But I _was_ talking about data register files, not predicate
> register files. Using predicates in combination with data
> register files with a single write port will often force
> serialization of execution. For example, try if-converting
> x=(c)?a:b;. With a single write port you are forced to place
> two assignments to x in separate instruction words. Select
> operations may be more effective than predicates for such an
> architecture.

Ah, I see what you mean (I think):
if (c)
x = a;
else
x = b;
if-converts to (predicated)
x = a when c
x = b when !c

In a VLIW with a general register file, you can schedule these two issue
slots in the same wide instruction. In a VLIW with a partitioned
register file as we've been discussing, we would have to schedule it in
two successive instructions.

Now we could certainly extend the ALU to do SELECT (x = pred ? a : b)
and still get by with 2-read 1-write register file slices.

Why not both? All issue slots predicated + add SELECT to the ALU
operation repetoire...?

Jan Gray, Gray Research LLC




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )