This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
I want to do a VLIW processor in an FPGA, but I'm not confident what I'm doing. Does anyone have links to sample (educational) VLIW processor designs ? I've searched the net but can't seem to find anything. Current design: http://www.birdcomputer.ca/blackbird.htm Thanks Rob |
|
|
|
rtfinch35 wrote: > > I want to do a VLIW processor in an FPGA, but I'm not confident what > I'm doing. Does anyone have links to sample (educational) VLIW > processor designs ? I've searched the net but can't seem to find > anything. > > Current design: http://www.birdcomputer.ca/blackbird.htm Why not a micro-code designed cpu? Since internal memory is faster than external memory why not go back to CISC's style architecture rather than say a RISC emulator. The instruction set could be raw micro-code (like risc) or one of 3 user configured tables. A fourth table is used for MMU/OS stuff. Here would be a nice CPU to emulate. http://www.spies.com/~aek/alto/index.html -- Ben Franchuk - Dawn * 12/24 bit cpu * www.jetnet.ab.ca/users/bfranchuk/index.html |
|
> I want to do a VLIW processor in an FPGA, but I'm not confident what > I'm doing. Does anyone have links to sample (educational) VLIW > processor designs ? I've searched the net but can't seem to find > anything. INTRODUCTION TO VLIW VLIW ::= very long instruction word: a machine with one instruction pointer (how's that, Reinoud? :-) ) that selects a long instruction word that issues a plurality of operations to a plurality of function units each cycle. In a VLIW system, the compiler schedules the multiple operations into instructions; in contrast, in a superscalar RISC, the instruction issue hardware schedules one or more scalar instructions to issue at the same time. You may also see the term LIW vs. VLIW. I think of an LIW as a 2 or 3 issue machine, a VLIW as a 3+ issue machine. In some VLIWs, to improve code density, a variable number of operations can be issued each cycle. There's a rich literature on VLIWs (that I don't purport to have read), but be sure to see: John R. Ellis, Bulldog: A Compiler for VLIW Architectures. The IA64 EPIC "explicitly parallel instruction computing" architecture is the most notable VLIW derivative. Several DSPs like the TI C6x are LIWs. A famous early VLIW was the Multiflow Trace family, a supercomputer startup that lived approximately 1984-1990. The challenging part of a VLIW project is the compiler. Unless you're a glutton for tricky assembly programming, or have a very small and specific problem, it's hardly worth designing the hardware if you don't have a way to compile code to use the parallelism presented by the hardware. Indeed, some LIW design suites (IIRC the Philips LIFE) allow you to pour C code through a compiler and simulator and configure the number and kind of function units to best trade off performance and area against the specific computation kernel you care about. Anyway, on to FPGA VLIW CPU issues. Here are some quick comments. To summarize my opinions, is a 2-issue machine is straightforward and even worthwhile, whereas (depending upon computation kernel) a 4+ issue machine may well bog down in its multiplexers and therefore may not make the best use of FPGA resources. INSTRUCTION FETCH AND ISSUE No sweat. Store instructions in a block RAM based instruction cache. Each BRAM in Virtex derived architectures can read 2x16=32 bits each cycle; each BRAM in Virtex-II can read 2x36=72 bits each cycle. Use 2-4 BRAMs and you can easily fetch 100-200 bits of instruction word each cycle. Keeping the I-cache fed is another matter, left as an exercise. REGISTER FILE DESIGN The key design problem is the register file organization. If you are going to issue n operations each cycle, you must fetch 1-2*n operands and retire up to n results each cycle. That implies a 1-2*n-read n-write per cycle register file. As a rule of thumb, in a modern LUT based FPGA, if a pipelined ALU operation requires about T ns, then you can read or write to a single port LUT RAM (or read AND write to a dual port LUT RAM) in about T/2 ns. (In Virtex-II, T is about 5 ns). Let's discuss an n=4 operation machine. It is a challenge to retire 4 results to a LUT RAM based RF in T ns. Sure, if the cycle time balloons to 2T, you can sequentially retire all four results in 4*T/2 ns, and read operands in parallel using dual port LUT RAM. Thus in a fully general organization, where any of the four results can retire to any four registers, "it can't be done" in T ns as defined above. Instead, consider a partitioned register file design. For instance, imagine a 64-register machine partitioned into 4 banks of 16 registers each. Each cycle, one result can be retired into each bank. That is easily targeted to a LUT RAM implementation. Indeed, you can fix the machine such that each issue slot is associated with one write port on the register file. Say bank m is assigned registers i, i div 4 == m, with 0<=i<64, 0<=m<4. We can easily issue four independent instructions such as r0=r1+r2; r16=r17+r18; r32=r33+r34; r48=r49+r50 One $64 design question, properly answerable only with a good compiler at your side, is how to then arrange the read ports. At one extreme, if each of the four banks is entirely isolated from the other, (like a left-brain/right-brain patient with their corpus collosum cut), then you only need 2 read ports on each bank. In this organization, if you need a register or two from another bank, you would typically burn an issue slot in that sourcing bank to read the registers, and maybe burn another slot in the destination bank to save the registers. (Alternately in the destination bank you can directly mux it into the dependent operation register). At the other extreme, if any of the four banks can with full generality read operands from any of the other four banks, e.g. r0=r1+r2; r16=r3+r4; r32=r5+r6; r48=r7+r8 // cycle 1 r0=r17+r18; r16=r19+r20; r32=r21+r22; r48=r23+r24; // cycle 2 r0=r33+r34; r16=r35+r36; r32=r37+r38; r48=r39+r40; // cycle 3 r0=r49+r50; r16=r51+r52; r32=r53+r54; r48=r55+r56 // cycle 4 Then each of the four banks would need to provide 8 read ports, and each of the 8 operand multiplexers (in front of the four ALUs) would need at least a 4-1 mux. You can of course provide all these ports with replicating or multi-cycling the register file LUT RAMs, but it won't be pretty. In practice, I believe that the expected degree of cross-bank register reading is limited. Maybe you typically only need 3 read ports per bank, perhaps 1 1/2 for that bank's ALU and perhaps 1 1/2 for other slots' accesses. But you have to have a scheduling compiler to help you make these tradeoffs. By the way, in my 1994 brain dump (http://fpgacpu.org/usenet/homebrew.html), with the sketch of the NVLIW1 2-issue LIW, I used exactly this partitioned register file technique, combining two banks of 3r-1w register files. For each issue slot, one operand was required to be a register in that bank; but the other operand could be a register in either bank. Another option is to combine multibanked registers with some heavily multiported registers. For instance, assume each issue slot can read or write 16 bank registers and 16 shared registers, subject across issue slots to some maximum number of shared register file read and write port accesses each cycle. The designers of the Sun MAJC used a similar approach. EXECUTION UNIT Assume a design with 4 ALUs, one branch unit, one memory port. Loads/stores: Perhaps you limit generality and only allow loads/stores to issue from a given slot (and then shuttle those results to other slot register banks using the above multiported reg file description). The general alternative, allowing any slot to issue loads/stores, requires you to mux any ALU output to be the ea, and a per-bank mux to allow the MDR input to be the result. Result forwarding (register file bypass): if an operation in slot i, cycle 1, is permitted with full generality, to consume the result of an operation from slot j, cycle 0, then you need n copies of an n:1 result forwarding mux. Again, this is very expensive, so you will be sorely tempted to reduce the generality of result forwarding, or eliminate it entirely. Again, this is a design tradeoff your compiler must help you select. CONTROL FLOW You want to reduce the number of branches. In average C code you can expect a branch every 5 (rule of thumb) generic RISC instructions or so. In a 4 issue machine you will spend your life branching. Instead, you will want to use some form of predicated execution. Some computations will set predicate bits that represent the outcomes of conditional tests. Then other instructions' execution (result write-back and other side effects) will be predicated upon these specified predicate bits. In this organization, it seems reasonable to allow any issue slot to establish predicates that can be used by any other issue slot; but for simplicity you will only need and want one, or perhaps two, of the issue slots to be able to issue (predicated) branch and jump instructions. You will want a compiler to help you design the specific details of the predication system. IN SUMMARY The sky's the limit. It is easy to build an 8- or 16- or 32-issue VLIW in an FPGA. That's just stamping out more instances of the i-cache + execution unit core slice. Whether the resulting machine runs much faster than a much simpler 2-issue LIW is critically dependent upon your compiler technology. ASIDE: WM As an aside, I'll just briefly mention Wulf's WM machine (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with many interesting features, but of particular interest here, each execute instruction is of the form rd = (rs1 op rs2) op rs3 , an explicitly parallel representation that affords some additional parallelism over the canonical scalar RISC, but which does not increase the number of write ports required on the register file. Jan Gray Gray Research LLC |
|
|
|
> Current design: http://www.birdcomputer.ca/blackbird.htm I see this design reflects many of the issues in my last message. I should read first, write second! Jan Gray Gray Research LLC |
|
> > The challenging part of a VLIW project is the compiler. Unless you're a > glutton for tricky assembly programming, or have a very small and > specific problem, it's hardly worth designing the hardware if you don't I'm planning on using the Transmeta approach. I can write a simple virtual instruction set to real instruction set translater. At first I'm aiming for a direct 1:1 translation that won't provide any speedup (in fact it'll slow things down because of the wider instruction word). As the translater gets better, performance will improve. (Perhaps I could use the xr16 ISA as the virtual instruction set? :) > Anyway, on to FPGA VLIW CPU issues. Here are some quick comments. > > To summarize my opinions, is a 2-issue machine is straightforward and > even worthwhile, whereas (depending upon computation kernel) a 4+ issue > machine may well bog down in its multiplexers and therefore may not make > the best use of FPGA resources. I quickly put together a datapath and it looks like the datapath takes about 1/3 of a SpartanII 2S200 and can run close to 50MHz, of course this is without all the other extras needed to make it work, and assuming I got it right (a pretty big assumption). The thing that appears to slow it down is actually the barrel shifter which I may get rid of by limiting shifts to 0 to 3 bits at once. I think I can get a peak instruction completion rate of five instructions per cycle (I admit a little contrived) but it makes for great marketing. (eg. 1) a compare (target = condition codes), 2) a store (no target reg), 3) an effective address calc (target = special addr reg) 4) a branch, 5) some alu operation 6) maybe ? Completing five instructions per cycle at 50MHz would give me a 250 MIPs processor... Well I can dream can't I ? > INSTRUCTION FETCH AND ISSUE > > No sweat. Store instructions in a block RAM based instruction > Planning on using a 96 bit (or possibly 128 bit) wide i-cache, with a 32 bit memory interface. > and retire up to n results each cycle. That implies a 1-2*n-read > n-write per cycle register file. Um, I think you need about half as many write ports because only about 50% of instructions actually update registers. > CONTROL FLOW > > Instead, you will want to use some form of predicated execution. I was thinking of using branch prediction. > ASIDE: WM > > As an aside, I'll just briefly mention Wulf's WM machine > (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with I tried to access some of the docs but the links seem to be broken :( > instruction is of the form > rd = (rs1 op rs2) op rs3 , If the operations are stacked like this, wouldn't it increase the cycle time of the processor ? Rob |
|
|
|
> I'm planning on using the Transmeta approach. I can write a simple > virtual instruction set to real instruction set translater. At first > I'm aiming for a direct 1:1 translation that won't provide any > speedup (in fact it'll slow things down because of the wider > instruction word). As the translater gets better, performance will > improve. (Perhaps I could use the xr16 ISA as the virtual instruction > set? :) Be our guest... > I quickly put together a datapath and it looks like the datapath > takes about 1/3 of a SpartanII 2S200 and can run close to 50MHz, of > course this is without all the other extras needed to make it work, > and assuming I got it right (a pretty big assumption). The thing that > appears to slow it down is actually the barrel shifter which I may > get rid of by limiting shifts to 0 to 3 bits at once. I think I can > get a peak instruction completion rate of five instructions per cycle > (I admit a little contrived) but it makes for great marketing. (eg. > 1) a compare (target = condition codes), 2) a store (no target reg), > 3) an effective address calc (target = special addr reg) 4) a branch, > 5) some alu operation 6) maybe ? Completing five instructions per > cycle at 50MHz would give me a 250 MIPs processor... Well I can dream > can't I ? What matters: * wall clock time * total energy consumed by the computation * FPGA area * dev tools ease of use. What doesn't matter: * aggregate MIPS :-) See e.g. http://fpgacpu.org/usenet/array.html for a (useless) 10^13 ops/s machine. > > and retire up to n results each cycle. That implies a 1-2*n-read > > n-write per cycle register file. > Um, I think you need about half as many write ports because only > about 50% of instructions actually update registers. Certainly but it depends upon whether and how you handle the worst case cases, whether you have multiplexers between ALUs and write port(s), etc. > > CONTROL FLOW > > > > Instead, you will want to use some form of predicated execution. > > I was thinking of using branch prediction. That will help. So you only fetch useless instructions on a mispredict? > > ASIDE: WM > > > > As an aside, I'll just briefly mention Wulf's WM machine > > (http://www.cs.virginia.edu/~wm/wm.html), a simple architecture with > > I tried to access some of the docs but the links seem to be broken :( Hmm. I have the ACM SIGARCH paper on it around here somewhere. The two big architectural ideas are the two ops per instruction and the implicit streaming memory ports exposed as special purpose registers. > > instruction is of the form > > rd = (rs1 op rs2) op rs3 , > > If the operations are stacked like this, wouldn't it increase the > cycle time of the processor ? No and no. First, IIRC, WM does the second op in a second pipeline stage. Second, as I first learned during the Sun SuperSPARC presentation at Hot Chips III (1991) (I think it was), the delay through two ALUs (a+b)+c is not necessarily the same as two times the delay through one ALU (a+b), since there is concurrency -- lower bits of (a+b) can be added to lower bits of c even as upper bits of (a+b) are still ripple-carry-settling. <Pause while I go and run some tests.> OK, here's are two quick and dirty test circuits, sans floorplanning: module onealu(clk, i, o); input clk, i; output o; reg o; reg [31:0] a, b, sum; reg z; always @(posedge clk) begin {a,b} <= {a,b,i}; sum <= a+b; z <= sum == 0; o <= z; end endmodule module twoalus(clk, i, o); input clk, i; output o; reg o; reg [31:0] a, b, c, sum; reg z; always @(posedge clk) begin {a,b,c} <= {a,b,c,i}; sum <= (a+b)+c; z <= sum == 0; o <= z; end endmodule 'onealu' approximates the delay from pipeline registers, through an adder, and back into pipeline registers. 'twoalu' approximates the same thing, but through two adders. Here are the TRCE numbers for Virtex-II -5: onealu: Slack: -0.237ns (requirement - (data path - negative clock skew)) Source: b[0] Destination: sum[31] Requirement: 4.500ns Data Path Delay: 4.737ns (Levels of Logic = 18) Negative Clock Skew: 0.000ns Source Clock: clk_c rising at 0.000ns Destination Clock: clk_c rising at 4.500ns Timing Improvement Wizard Data Path: b[0] to sum[31] Location Delay type Delay(ns) Physical Resource Logical Resource(s) ------------------------------------------------- ------------------- L3.IQ1 Tiockiq 0.598 i b[0] SLICE_X0Y0.F1 net (fanout=2) 0.516 b[0] SLICE_X0Y0.COUT Topcyf 0.755 sum[0] sum_1_axb_0 sum_1_cry_0 sum_1_cry_1 SLICE_X0Y1.CIN net (fanout=1) 0.000 sum_1_cry_1/O SLICE_X0Y1.COUT Tbyp 0.092 sum[2] SLICE_X0Y2.CIN net (fanout=1) 0.000 sum_1_cry_3/O SLICE_X0Y2.COUT Tbyp 0.092 sum[4] SLICE_X0Y3.CIN net (fanout=1) 0.000 sum_1_cry_5/O SLICE_X0Y3.COUT Tbyp 0.092 sum[6] SLICE_X0Y4.CIN net (fanout=1) 0.000 sum_1_cry_7/O SLICE_X0Y4.COUT Tbyp 0.092 sum[8] SLICE_X0Y5.CIN net (fanout=1) 0.000 sum_1_cry_9/O SLICE_X0Y5.COUT Tbyp 0.092 sum[10] SLICE_X0Y6.CIN net (fanout=1) 0.000 sum_1_cry_11/O SLICE_X0Y6.COUT Tbyp 0.092 sum[12] SLICE_X0Y7.CIN net (fanout=1) 0.000 sum_1_cry_13/O SLICE_X0Y7.COUT Tbyp 0.092 sum[14] SLICE_X0Y8.CIN net (fanout=1) 0.000 sum_1_cry_15/O SLICE_X0Y8.COUT Tbyp 0.092 sum[16] SLICE_X0Y9.CIN net (fanout=1) 0.000 sum_1_cry_17/O SLICE_X0Y9.COUT Tbyp 0.092 sum[18] SLICE_X0Y10.CIN net (fanout=1) 0.000 sum_1_cry_19/O SLICE_X0Y10.COUT Tbyp 0.092 sum[20] SLICE_X0Y11.CIN net (fanout=1) 0.000 sum_1_cry_21/O SLICE_X0Y11.COUT Tbyp 0.092 sum[22] SLICE_X0Y12.CIN net (fanout=1) 0.000 sum_1_cry_23/O SLICE_X0Y12.COUT Tbyp 0.092 sum[24] SLICE_X0Y13.CIN net (fanout=1) 0.000 sum_1_cry_25/O SLICE_X0Y13.COUT Tbyp 0.092 sum[26] SLICE_X0Y14.CIN net (fanout=1) 0.000 sum_1_cry_27/O SLICE_X0Y14.COUT Tbyp 0.092 sum[28] SLICE_X0Y15.CIN net (fanout=1) 0.000 sum_1_cry_29/O SLICE_X0Y15.Y Tciny 1.257 sum[30] sum_1_cry_30 sum_1_s_31 SLICE_X0Y15.DY net (fanout=1) 0.001 sum_1[31] SLICE_X0Y15.CLK Tdyck 0.322 sum[30] sum[31] ------------------------------------------------- --------------------------- Total 4.737ns (4.220ns logic, 0.517ns route) (89.1% logic, 10.9% route) twoalus: Destination: sum[31] Requirement: 6.000ns Data Path Delay: 6.773ns (Levels of Logic = 18) Negative Clock Skew: 0.000ns Source Clock: clk_c rising at 0.000ns Destination Clock: clk_c rising at 6.000ns Timing Improvement Wizard Data Path: a[2] to sum[31] Location Delay type Delay(ns) Physical Resource Logical Resource(s) ------------------------------------------------- ------------------- SLICE_X11Y0.YQ Tcko 0.493 a[3] a[2] SLICE_X8Y1.F1 net (fanout=2) 0.749 a[2] SLICE_X8Y1.COUT Topcyf 0.668 sum_1_0[2] sum_1_0_axb_2 sum_1_0_cry_2 sum_1_0_cry_3 SLICE_X8Y2.CIN net (fanout=1) 0.000 sum_1_0_cry_3/O SLICE_X8Y2.COUT Tbyp 0.092 sum_1_0[4] SLICE_X8Y3.CIN net (fanout=1) 0.000 sum_1_0_cry_5/O SLICE_X8Y3.COUT Tbyp 0.092 sum_1_0[6] SLICE_X8Y4.CIN net (fanout=1) 0.000 sum_1_0_cry_7/O SLICE_X8Y4.COUT Tbyp 0.092 sum_1_0[8] SLICE_X8Y5.CIN net (fanout=1) 0.000 sum_1_0_cry_9/O SLICE_X8Y5.COUT Tbyp 0.092 sum_1_0[10] SLICE_X8Y6.CIN net (fanout=1) 0.000 sum_1_0_cry_11/O SLICE_X8Y6.COUT Tbyp 0.092 sum_1_0[12] SLICE_X8Y7.CIN net (fanout=1) 0.000 sum_1_0_cry_13/O SLICE_X8Y7.COUT Tbyp 0.092 sum_1_0[14] SLICE_X8Y8.CIN net (fanout=1) 0.000 sum_1_0_cry_15/O SLICE_X8Y8.COUT Tbyp 0.092 sum_1_0[16] SLICE_X8Y9.CIN net (fanout=1) 0.000 sum_1_0_cry_17/O SLICE_X8Y9.COUT Tbyp 0.092 sum_1_0[18] SLICE_X8Y10.CIN net (fanout=1) 0.000 sum_1_0_cry_19/O SLICE_X8Y10.COUT Tbyp 0.092 sum_1_0[20] SLICE_X8Y11.CIN net (fanout=1) 0.000 sum_1_0_cry_21/O SLICE_X8Y11.COUT Tbyp 0.092 sum_1_0[22] SLICE_X8Y12.CIN net (fanout=1) 0.000 sum_1_0_cry_23/O SLICE_X8Y12.COUT Tbyp 0.092 sum_1_0[24] SLICE_X8Y13.CIN net (fanout=1) 0.000 sum_1_0_cry_25/O SLICE_X8Y13.COUT Tbyp 0.092 sum_1_0[26] SLICE_X8Y14.CIN net (fanout=1) 0.000 sum_1_0_cry_27/O SLICE_X8Y14.Y Tciny 1.257 sum_1_0[28] sum_1_0_cry_28 sum_1_0_s_29 SLICE_X9Y14.G2 net (fanout=1) 0.403 sum_1_0[29] SLICE_X9Y14.COUT Topcyg 0.519 sum[28] sum_1_1_axb_29 sum_1_1_cry_29 SLICE_X9Y15.CIN net (fanout=1) 0.000 sum_1_1_cry_29/O SLICE_X9Y15.Y Tciny 1.257 sum[30] sum_1_1_cry_30 sum_1_1_s_31 SLICE_X9Y15.DY net (fanout=1) 0.001 sum_1[31] SLICE_X9Y15.CLK Tdyck 0.322 sum[30] sum[31] ------------------------------------------------- --------------------------- Total 6.773ns (5.620ns logic, 1.153ns route) (83.0% logic, 17.0% route) 4.7 ns for one ALU in one pipeline stage 6.8 ns for two ALUs in one pipeline stage Hmm, that's worse than I had expected. Looking above, the expensive bit is the extra Topcyg and especially Tciny of the second adder. All we are saving with the concurrency is the 15*Tbyp of about 1.4 ns. Jan Gray, Gray Research LLC |
|
Interesting. But why do both onealu and twoalus have 18 levels of logic? > Data Path Delay: 4.737ns (Levels of Logic = 18) > Data Path Delay: 6.773ns (Levels of Logic = 18) I would have expected (probably naively) them to be 32 and 64, respectively. So what gives? -jc |
|
|
|
Hi people, Does anybody know any site, continuously up to date,containing news about FPGAs and reconfigurable computing ? thanks ========================================= Alessandro Noriaki Ide Mestrado em Ciencia da Computacao Universidade Federal de Sao Carlos http://www.dc.ufscar.br/~noriaki e-mail: uin: 46269985 ========================================== |
|
Jan, Excellent writeup; here's too late and too short a reply... > The challenging part of a VLIW project is the compiler. ...and making compiler and hardware work together to keep hardware bottlenecks like bypass (i.e. forwarding) network and register file port scaling in check (for >4 ILP). > To summarize my opinions, is a 2-issue machine is straightforward and > even worthwhile, whereas (depending upon computation kernel) a 4+ issue > machine may well bog down in its multiplexers and therefore may not make > the best use of FPGA resources. There are several ways around this. One obvious approach is clustering (a cluster is basically a small VLIW section with full bypassing and register ports, that is only loosely coupled with other sections). Another less obvious approach is to explicitly program the bypass busses (which allows for a huge reduction in bypass _and_ register file port cost!). Hint, see: http://ce-serv.et.tudelft.nl/MOVE/ (Disclaimer: that's where I work, although on an interesting and mostly unpublished new arch variety.) > REGISTER FILE DESIGN > > ... > > One $64 design question, properly answerable only with a good compiler > at your side, is how to then arrange the read ports. At one extreme, if > each of the four banks is entirely isolated from the other, (like a > left-brain/right-brain patient with their corpus collosum cut), then you > only need 2 read ports on each bank. This probably won't work out well. Such isolation _does_ work quite well for clusters within a wide VLIW, but within the cluster you'll need more communication (we have an upcoming paper about this;). > Result forwarding (register file bypass): if an operation in slot i, > cycle 1, is permitted with full generality, to consume the result of an > operation from slot j, cycle 0, then you need n copies of an n:1 result > forwarding mux. Again, this is very expensive, so you will be sorely > tempted to reduce the generality of result forwarding, or eliminate it > entirely. Again, this is a design tradeoff your compiler must help you > select. Generality makes scheduling easier, but not much, and is certainly not needed. The effects of clustering are fairly well documented, and we've had excellent results with scheduling for more general limited interconnect VLIWs. > Instead, you will want to use some form of predicated execution. Some > computations will set predicate bits that represent the outcomes of > conditional tests. Then other instructions' execution (result > write-back and other side effects) will be predicated upon these > specified predicate bits. This won't work well with your proposed single-write-port register files. When selecting between two potential producers of some value, with a single write port, you'll have to serialize the predicated register writes... Which tends to hurt on a VLIW. Predicates on a VLIW make most sense with multiple write ports. - Reinoud PS. Did you ever consider writing a book on FPGA-CPU's? |
|
|
|
> Hint, see: > http://ce-serv.et.tudelft.nl/MOVE/ Ah, move machines. I will take a look, thank you. > At one extreme, > > if each of the four banks is entirely isolated from the > other, (like a > > left-brain/right-brain patient with their corpus collosum > cut), then > > you only need 2 read ports on each bank. > > This probably won't work out well. Such isolation _does_ > work quite well for clusters within a wide VLIW, but within > the cluster you'll need more communication (we have an > upcoming paper about this;). I agree, I also doubt it will work well, but I was just trying to sketch the extremes of possibility. > > Instead, you will want to use some form of predicated > execution. Some > > computations will set predicate bits that represent the outcomes of > > conditional tests. Then other instructions' execution (result > > write-back and other side effects) will be predicated upon these > > specified predicate bits. > > This won't work well with your proposed single-write-port > register files. When selecting between two potential > producers of some value, with a single write port, you'll > have to serialize the predicated register writes... Which > tends to hurt on a VLIW. Predicates on a VLIW make most > sense with multiple write ports. I advocate 1- or 2-write *data* register files for LUT RAM implementations. This is not a constraint for predicate register files. N-read m-write predicate register files can easily be made from assemblies of flip-flops, logic to drive CEs, and logic for output multiplexers. By the way, on my trip down to SF for ESC, I took along "Readings in Computer Architecture", ed. by Mark Hill et al. It's a wonderful book, most highly recommended, a compilation of 5 dozen important modern computer architecture papers, well organized and set in context. Apropos to this discussion is the paper "A Comparison of Full and Partial Predicated Execution Support for ILP Processors" by Mahlke et al, which compares fully predication architectures to traditional non-predicated superscalars retrofitted with CMOVE (if (pred) rdest = rsrc) and SELECT (rdest = pred ? rsrc1 : rsrc2). This paper provides a nice introduction to ILP compiler support (superblocks and if-conversion etc.) and how to use that to take e.g. a 14 cycle grep inner loop down to 6 cycles in a 4-issue 1-branch VLIW. By the way^2, this is the kind of compilation IA64 must do well to compete with out-of-order superscalars. I suspect one of IA64's greatest challenges will be the social engineering required to make software developers change from compile -optimized *.c to compile -profiling *.c profile app against representative workloads compile -profile-feedback-optimizing *.c The latter is more difficult to get right and to institutionalize in makefiles or what-have-you. > PS. Did you ever consider writing a book on FPGA-CPU's? Thank you very much for the compliment. '96-'00 I wrote a few book outlines, and discussed the project with a leading publisher, but never pushed the button. "Writing is an adventure. To begin with, it is a toy and an amusement. Then it becomes a mistress, then it becomes a master, then it becomes a tyrant. The last phase is that just as you are about to be reconciled to your servitude, you kill the monster and fling him to the public." -- Churchill. The Circuit Cellar articles were an attempt to get the book bug out of my system. Didn't work. Jan Gray, Gray Research LLC |
|
|
|
Another rather late reply (just back from vacation :-). "Jan Gray" <> wrote: > > > Instead, you will want to use some form of predicated execution. Some > > > computations will set predicate bits that represent the outcomes of > > > conditional tests. Then other instructions' execution (result > > > write-back and other side effects) will be predicated upon these > > > specified predicate bits. > > > > This won't work well with your proposed single-write-port > > register files. When selecting between two potential > > producers of some value, with a single write port, you'll > > have to serialize the predicated register writes... Which > > tends to hurt on a VLIW. Predicates on a VLIW make most > > sense with multiple write ports. > > I advocate 1- or 2-write *data* register files for LUT RAM > implementations. This is not a constraint for predicate register files. But I _was_ talking about data register files, not predicate register files. Using predicates in combination with data register files with a single write port will often force serialization of execution. For example, try if-converting x=(c)?a:b;. With a single write port you are forced to place two assignments to x in separate instruction words. Select operations may be more effective than predicates for such an architecture. > "Writing is an adventure. To begin with, it is a toy and an amusement. > Then it becomes a mistress, then it becomes a master, then it becomes a > tyrant. The last phase is that just as you are about to be reconciled to > your servitude, you kill the monster and fling him to the public." -- > Churchill. Yeah, that's why I asked if you would consider writing one. I sure wouldn't :-). - Reinoud |
|
|
|
> But I _was_ talking about data register files, not predicate > register files. Using predicates in combination with data > register files with a single write port will often force > serialization of execution. For example, try if-converting > x=(c)?a:b;. With a single write port you are forced to place > two assignments to x in separate instruction words. Select > operations may be more effective than predicates for such an > architecture. Ah, I see what you mean (I think): if (c) x = a; else x = b; if-converts to (predicated) x = a when c x = b when !c In a VLIW with a general register file, you can schedule these two issue slots in the same wide instruction. In a VLIW with a partitioned register file as we've been discussing, we would have to schedule it in two successive instructions. Now we could certainly extend the ALU to do SELECT (x = pred ? a : b) and still get by with 2-read 1-write register file slices. Why not both? All issue slots predicated + add SELECT to the ALU operation repetoire...? Jan Gray, Gray Research LLC |