Jan, I very much agree that efficient large parallel systems have to provide the right ratio of resources. I also think that control is just one of the resources to share, and in many cases has to be shared among multiple function units for best results. From this point of view, using multiscalar PEs is just a natural extension of your proposal... > You're right, and I should have addressed multiple issue PEs -- but (I > think) so am I. Even in a three issue VLIW, if you only need multiply > one out of every ten instruction issue slots (3 or 4 cycles), maybe you > should be sharing your multiply unit with other instances of your PE. Yes, or maybe you should use VLIW PEs with an issue width of 10 in that case. For applications with a large amount of parallelism, PEs with an issue width of around 10 seems to be a sweet spot (beyond that width basic block sizes tend to become a problem and cycle times start to suffer). Also, at that width you can usually avoid the complexities of sharing resources between PEs (except memory of course, which presents a rather hairy problem all by itself). Note that there exist VLIW architectures that scale nicely to 10+ issue widths, by using distributed register files and limited interconnect/bypassing. Also, note that if the amount of parallelism available is 'embarassing' enough to keep lots of single-issue PEs busy, this parallelism usually maps well to (significantly cheaper) VLIW or even SIMD architectures. For example, consider how many apps vectorize well. > We'll see. :-) To some extent, it comes down to this. Some useful > resource, done right, is too expensive to assign to each PE. Either we > do a stripped down and limited subset of resource that is *just* > affordable per PE, or we leave out the resource, and then *share* an > instance, done right, amongst 3 or 5 or 10 PEs. But it seems to me that the latter will nearly always be less efficient than sharing said resource with several others within one multi-issue PE. The single, centralized control will not only be simpler and cheaper, but also more efficient because of the scheduling opportunities. > I know from staring at it that a proper data cache, that also does byte > and halfword alignment, and so forth, rivals the size of an austere > processor data path, and only gets used every 2 or every 3 instructions. > *** Particularly in an implementation fabric that is block RAM port > constrained ***, it is very tempting to me to share one data cache > between 2 or 3 PEs. I fully agree; however, this works just as well for a VLIW. On a 10-issue VLIW you may have just 2 slots for load-store (and none of the overhead for sharing). > The implementation cost of the sharing is some > muxing and some arbitration logic, and perhaps some address-space-ID > tags and tag checking. And as I note in the article, once you have paid > for the muxing and the arbitration logic per PE, maybe you don't have to > pay for it over and over again as you share additional resources. Wouldn't you like to be able to use multiple shared resources independently (say, one PE accesses the multiplier, another the barrel shifter, and a third shared memory, in the same cycle)? Otherwise you'll be severely limiting the utilization of these expensive resources! Implementing parallel access to these is much simpler (and cheaper) when there is a single control source... > In any event, my goal was to convey the wide applicability of the > concept, and how deeply it may lead you to rethink architecture in a > multiprocessor -- not to specifically champion this or that resource as > something that must be shared. And I'm trying to make you see control as yet another resource that can, and often should, be shared - to the point where it obviates the need for sharing most other resources. > 1) I don't think VLIWs are the best fit technology-mapping and > area-wise for an FPGA implementation. That is, I believe, for > example, that a three issue LIW will be less efficient (MIPS/LUT) > than three instances of a single issue RISC, even though the latter > instances incur three sets of instruction fetchers, PC incrementers, > etc. I'll try to explain that more, and explore the LIW design > space, in a subsequent write-up. I'm looking forward to rebutting that future write-up ;-). Note that the particular variety of VLIW that I think has the most potential uses explicitly programmed bypasses, which leads to a huge reduction in bypass (i.e. bus/selector) cost for wide configurations. > 2) Past about 3-issue, it seems very hard for a compiler to keep all > those issue slots busy with useful work, so they don't scale up > enough, This depends on how much of the parallelism can be extracted as instruction-level parallelism (ILP). Most applications with abundant parallelism map very well to 10+ issue widths (VLIW or SIMD/vector), at least if coded reasonably. The 3-issue ILP limit you mention is for essentially *sequential* code. For that matter, are you aware of any compiler that can keep more than 3 _PEs_ busy? (And if you feel it's okay to manually parallelize code for multiple single-issue PEs, then it also seems reasonable to manually schedule VLIWs...) > and then you are back to MPs. > > And wider issue VLIWs need a heroic compiler research program, which > I don't have the resources to chase. (Since all I have is a hammer, > everything looks like a nail to me.) Well... What if your PEs have to communicate/synchronize a lot? Many apps require that to be able to use the parallelism. Now, within a single instruction stream, synchronization is implicit and communication is easy and cheap. So with VLIW PEs, you can go a long way by partitioning for minimum communication, and scheduling the operations within a node. Yes, that's hard, but is it harder than targetting your massively parallel shared-memory MP? Do you have a compiler which can target one of your shared-memory clusters (from your regular sequential HLL source)?. Isn't this even _harder_ than scheduling for VLIW? Aren't you just accepting that you'll have to program your intra-cluster communication explicitly, and would it not be fair, then, to accept manual scheduling for a VLIW PE also? Your approach is beautiful and simple in concept. However, I'm not buying your arguments that it is particularly efficient or easy to use :-). That's true only for apps which map trivially onto that particular structure, but then the same could be said for VLIWs or any other programmable structure... On a final note, there are several VLIW compilers/schedulers more or less freely available, each with its particular limitations of course (I don't have up-to-date pointers, but can ask a colleague for them if you like). > I suppose my pet hand-wavy application for these concept chip-MPs is > lexing and parsing XML and filtering that (and/or parse table > construction for same) -- see http://www.fpgacpu.org/usenet/re.html). Very interesting indeed! Best regards, - Reinoud |