Jan, I very much agree that efficient large parallel systems have to provide the right ratio of resources. I also think that control is just one of the resources to share, and in many cases has to be shared among multiple function units for best results. From this point of view, using multiscalar PEs is just a natural extension of your proposal... > You're right, and I should have addressed multiple issue PEs -- but (I > think) so am I. Even in a three issue VLIW, if you only need multiply > one out of every ten instruction issue slots (3 or 4 cycles), maybe you > should be sharing your multiply unit with other instances of your PE. Yes, or maybe you should use VLIW PEs with an issue width of 10 in that case. For applications with a large amount of parallelism, PEs with an issue width of around 10 seems to be a sweet spot (beyond that width basic block sizes tend to become a problem and cycle times start to suffer). Also, at that width you can usually avoid the complexities of sharing resources between PEs (except memory of course, which presents a rather hairy problem all by itself). Note that there exist VLIW architectures that scale nicely to 10+ issue widths, by using distributed register files and limited interconnect/bypassing. Also, note that if the amount of parallelism available is 'embarassing' enough to keep lots of single-issue PEs busy, this parallelism usually maps well to (significantly cheaper) VLIW or even SIMD architectures. For example, consider how many apps vectorize well. > We'll see. :-) To some extent, it comes down to this. Some useful > resource, done right, is too expensive to assign to each PE. Either we > do a stripped down and limited subset of resource that is *just* > affordable per PE, or we leave out the resource, and then *share* an > instance, done right, amongst 3 or 5 or 10 PEs. But it seems to me that the latter will nearly always be less efficient than sharing said resource with several others within one multi-issue PE. The single, centralized control will not only be simpler and cheaper, but also more efficient because of the scheduling opportunities. > I know from staring at it that a proper data cache, that also does byte > and halfword alignment, and so forth, rivals the size of an austere > processor data path, and only gets used every 2 or every 3 instructions. > *** Particularly in an implementation fabric that is block RAM port > constrained ***, it is very tempting to me to share one data cache > between 2 or 3 PEs. I fully agree; however, this works just as well for a VLIW. On a 10-issue VLIW you may have just 2 slots for load-store (and none of the overhead for sharing). > The implementation cost of the sharing is some > muxing and some arbitration logic, and perhaps some address-space-ID > tags and tag checking. And as I note in the article, once you have paid > for the muxing and the arbitration logic per PE, maybe you don't have to > pay for it over and over again as you share additional resources. Wouldn't you like to be able to use multiple shared resources independently (say, one PE accesses the multiplier, another the barrel shifter, and a third shared memory, in the same cycle)? Otherwise you'll be severely limiting the utilization of these expensive resources! Implementing parallel access to these is much simpler (and cheaper) when there is a single control source... > In any event, my goal was to convey the wide applicability of the > concept, and how deeply it may lead you to rethink architecture in a > multiprocessor -- not to specifically champion this or that resource as > something that must be shared. And I'm trying to make you see control as yet another resource that can, and often should, be shared - to the point where it obviates the need for sharing most other resources. > 1) I don't think VLIWs are the best fit technology-mapping and > area-wise for an FPGA implementation. That is, I believe, for > example, that a three issue LIW will be less efficient (MIPS/LUT) > than three instances of a single issue RISC, even though the latter > instances incur three sets of instruction fetchers, PC incrementers, > etc. I'll try to explain that more, and explore the LIW design > space, in a subsequent write-up. I'm looking forward to rebutting that future write-up ;-). Note that the particular variety of VLIW that I think has the most potential uses explicitly programmed bypasses, which leads to a huge reduction in bypass (i.e. bus/selector) cost for wide configurations. > 2) Past about 3-issue, it seems very hard for a compiler to keep all > those issue slots busy with useful work, so they don't scale up > enough, This depends on how much of the parallelism can be extracted as instruction-level parallelism (ILP). Most applications with abundant parallelism map very well to 10+ issue widths (VLIW or SIMD/vector), at least if coded reasonably. The 3-issue ILP limit you mention is for essentially *sequential* code. For that matter, are you aware of any compiler that can keep more than 3 _PEs_ busy? (And if you feel it's okay to manually parallelize code for multiple single-issue PEs, then it also seems reasonable to manually schedule VLIWs...) > and then you are back to MPs. > > And wider issue VLIWs need a heroic compiler research program, which > I don't have the resources to chase. (Since all I have is a hammer, > everything looks like a nail to me.) Well... What if your PEs have to communicate/synchronize a lot? Many apps require that to be able to use the parallelism. Now, within a single instruction stream, synchronization is implicit and communication is easy and cheap. So with VLIW PEs, you can go a long way by partitioning for minimum communication, and scheduling the operations within a node. Yes, that's hard, but is it harder than targetting your massively parallel shared-memory MP? Do you have a compiler which can target one of your shared-memory clusters (from your regular sequential HLL source)?. Isn't this even _harder_ than scheduling for VLIW? Aren't you just accepting that you'll have to program your intra-cluster communication explicitly, and would it not be fair, then, to accept manual scheduling for a VLIW PE also? Your approach is beautiful and simple in concept. However, I'm not buying your arguments that it is particularly efficient or easy to use :-). That's true only for apps which map trivially onto that particular structure, but then the same could be said for VLIWs or any other programmable structure... On a final note, there are several VLIW compilers/schedulers more or less freely available, each with its particular limitations of course (I don't have up-to-date pointers, but can ask a colleague for them if you like). > I suppose my pet hand-wavy application for these concept chip-MPs is > lexing and parsing XML and filtering that (and/or parse table > construction for same) -- see http://www.fpgacpu.org/usenet/re.html). Very interesting indeed! Best regards, - Reinoud

Renoid, thank you for your superb comments. > The limitations you claim for uniprocessor design exist only > if you restrict yourself to scalar processors. For > superscalar and VLIW designs, you have a similar freedom to > ratio function unit types. > For example, a typical superscalar design with three integer > issue slots will support multiply on only one of those. You're right, and I should have addressed multiple issue PEs -- but (I think) so am I. Even in a three issue VLIW, if you only need multiply one out of every ten instruction issue slots (3 or 4 cycles), maybe you should be sharing your multiply unit with other instances of your PE. > Also, I have to doubt your claim of full utilization of > shared function units (and cores) in a cluster, even if you > provide them in a statistically perfect ratio for some > application. Yes, I agree. My math was sloppy -- it was a concept I was trying to convey. > One problem is that the optimal ratio varies > over time. Another problem is that in order to use them > fully, their use needs to be perfectly scheduled, in your > case even over multiple independent instruction streams... > That is a particularly hard problem, it's often impossible to > get near full utilization even with just a single instruction stream. It is not so much the perfect scheduling or utilization I was after as the illusion that (most of the time) you have a dedicated unit even though you are sharing it with other PEs. I agree that my description over promises the concept. Put aside the scheduling notion, that's too hairy. Using statistics or queuing theory you can model how many shared resources belong with each cluster so as to bound the expected waiting time for a shared resource to a certain threshold. [By the way, I write "you can model" because *I can't* anymore -- that RAM went unrefreshed for too long. :-)] > IMHO your argument gets a bit stressed when you suggest that > things like cache and load-store resources might be shared. > I think you'll find that sharing such resources among > multiple instruction streams is both complex and costly, but > that isn't even my point. We'll see. :-) To some extent, it comes down to this. Some useful resource, done right, is too expensive to assign to each PE. Either we do a stripped down and limited subset of resource that is *just* affordable per PE, or we leave out the resource, and then *share* an instance, done right, amongst 3 or 5 or 10 PEs. I know from staring at it that a proper data cache, that also does byte and halfword alignment, and so forth, rivals the size of an austere processor data path, and only gets used every 2 or every 3 instructions. *** Particularly in an implementation fabric that is block RAM port constrained ***, it is very tempting to me to share one data cache between 2 or 3 PEs. The implementation cost of the sharing is some muxing and some arbitration logic, and perhaps some address-space-ID tags and tag checking. And as I note in the article, once you have paid for the muxing and the arbitration logic per PE, maybe you don't have to pay for it over and over again as you share additional resources. In any event, my goal was to convey the wide applicability of the concept, and how deeply it may lead you to rethink architecture in a multiprocessor -- not to specifically champion this or that resource as something that must be shared. I understand and appreciate your skepticism and you may well be right! > The point is that there is a > breathtakingly simple and obvious way to share all these > resources, and keep the scheduling problem tractable: don't > use multiple instruction streams! Instead, use a single > instruction stream to directly control a nicely ratioed set > of function units. This, of course, is known as VLIW... Good, good push back, thank you very much. I love LIWs (see last half of http://www.fpgacpu.org/usenet/homebrew.html), and I have a nice design for a three issue 3x21d-bit instruction word LIW percolating right here (taps noggin). But I have two problems with them. 1) I don't think VLIWs are the best fit technology-mapping and area-wise for an FPGA implementation. That is, I believe, for example, that a three issue LIW will be less efficient (MIPS/LUT) than three instances of a single issue RISC, even though the latter instances incur three sets of instruction fetchers, PC incrementers, etc. I'll try to explain that more, and explore the LIW design space, in a subsequent write-up. 2) Past about 3-issue, it seems very hard for a compiler to keep all those issue slots busy with useful work, so they don't scale up enough, and then you are back to MPs. And wider issue VLIWs need a heroic compiler research program, which I don't have the resources to chase. (Since all I have is a hammer, everything looks like a nail to me.) I agree that an MP of 2- or 3-issue LIWs merits serious consideration compared to an MP of scalar RISCs. > I'm not arguing against multiprocessors in general. But the > function unit ratioing and resource sharing already exist at > the uniprocessor level, where they can be used more efficiently. Thank you again. In my enthusiasm for the concept, I did indeed overlook the points you make. > Finally, there may of course be applications where the > relatively large amount of control provided by the many > instruction streams in your approach (Sea of Cores - SoC?;) > are an advantage. The challenge will be in finding those > applications... Can you think of any? [First, let me note that most of the time, I too would prefer 1 1000 MIPS processor to 10 200 MIPS processors or 100 50 MIPS processors. That said ...] [Read along with me, to the sound of future patents flushing:] I confess, looking at the V600E 60-way MP I described recently, or its logical follow ons in V2 and so forth, I confess that these are paper tigers, with a lot of integer MIPS, in want of an application. Aggregate "D-MIPS" is not an application! I suppose my pet hand-wavy application for these concept chip-MPs is lexing and parsing XML and filtering that (and/or parse table construction for same) -- see http://www.fpgacpu.org/usenet/re.html). Let me set the stage for you. Imagine a future in which "web services" are ubiquitous -- the internet has evolved into a true distributed operating system, a cloud offering services to several billion connected devices. Imagine that the current leading transport candidate for internet RPC, namely SOAP -- (Simple Object Access Protocol, e.g. XML encoded RPC arguments and return values, on an HTTP transport, with interfaces described in WSDL (itself based upon XML Schema)) -- imagine SOAP indeed becomes the standard internet RPC. That's a *ton* of XML flying around. You will want your routers and firewalls, etc. of the future to filter, classify, route, etc. that XML at wire speed. That's a *ton* of ASCII lexing, parsing, and filtering. It's trivially parallelizable -- every second a thousand or a million separate HTTP sessions flash past your ports -- and therefore potentially a nice application for rack full of FPGAs, most FPGAs implementing a 100-way parsing and classification multiprocessor. Jan Gray Gray Research LLC

Jan, This is a beautiful concept! I hope you don't mind a few arguments against it. :-) The limitations you claim for uniprocessor design exist only if you restrict yourself to scalar processors. For superscalar and VLIW designs, you have a similar freedom to ratio function unit types. For example, a typical superscalar design with three integer issue slots will support multiply on only one of those. Also, I have to doubt your claim of full utilization of shared function units (and cores) in a cluster, even if you provide them in a statistically perfect ratio for some application. One problem is that the optimal ratio varies over time. Another problem is that in order to use them fully, their use needs to be perfectly scheduled, in your case even over multiple independent instruction streams... That is a particularly hard problem, it's often impossible to get near full utilization even with just a single instruction stream. IMHO your argument gets a bit stressed when you suggest that things like cache and load-store resources might be shared. I think you'll find that sharing such resources among multiple instruction streams is both complex and costly, but that isn't even my point. The point is that there is a breathtakingly simple and obvious way to share all these resources, and keep the scheduling problem tractable: don't use multiple instruction streams! Instead, use a single instruction stream to directly control a nicely ratioed set of function units. This, of course, is known as VLIW... Sure, there are limits to useful scaling of VLIW cores, so if you want to scale further, it makes sense to go multiprocessor or SIMD. I'm not arguing against multiprocessors in general. But the function unit ratioing and resource sharing already exist at the uniprocessor level, where they can be used more efficiently. Finally, there may of course be applications where the relatively large amount of control provided by the many instruction streams in your approach (Sea of Cores - SoC?;) are an advantage. The challenge will be in finding those applications... Can you think of any? Regards, - Reinoud

On Tue, Mar 05, 2002 at 01:07:03PM -0800, Jan Gray thus spake: > [From today's entry at FPGA CPU News] > > Multiprocessors, Jan's Razor, resource sharing, and all that > Very well-thought out. I bet H&P would approve. This might be of germane interest to the topic: http://www.colorforth.com/25x.html Think this embodies what you mean? (the Forth bias of the chip not withstanding..) Eric

Jan Gray wrote: > <snip> > A more thoughtful approach, achieving fractional function units by > careful resource sharing, would appear to yield significant PE area > reductions, maximizing overall multiprocessor throughput, and providing > the system designer with a new dimension of design flexibility. Remember too the use of RISC machines over CISC processors was do to better effective address calculation and simple integer operations over the larger CISC set. A feature like 32 bit shifts was a advantage not because a person often shifts say 27 bits but rather 1,2,4 bits up or 1 bit down for simple shifts and efa calculations. Also as you get more processors on a chip memory bandwidth needs to increase, thus a super fast module may not be needed if it has to wait for memory. -- Ben Franchuk - Dawn * 12/24 bit cpu * www.jetnet.ab.ca/users/bfranchuk/index.html

[From today's entry at FPGA CPU News] Multiprocessors, Jan's Razor, resource sharing, and all that Continuing our multiprocessor (MP) discussion of Saturday. Today's theme: Compared to uniprocessor design, building a large-N (N>20) multiprocessor out of compact, austere soft processor core + BRAM processing elements (PEs) is challenging and ... liberating! Liberating in the sense that the hard trade-offs are easier to make due to Jan's Razor (to coin a phrase), the principle that once you achieve a minimalist instruction set that covers your computation, any additional functionality per processor is usually unprofitable and therefore rejected. This drive to leave things out is moderated by the new opportunity, unique to MPs, to apply resource factoring to share instances of critical but relatively-infrequently-used resources between PEs. Permit me to explain. Uniprocessor Throughput The greater part of processor and computer system design is deciding what features to put in, and what features to leave out. You include a feature if its benefits outweigh its costs. Usually these costs and benefits are measured in terms of throughput, latency, area, power, complexity, and so forth. Designers typically focus on throughput, that is, the reciprocal of execution time. In a uniprocessor, the execution time of a computation is the product of no. of instructions (I), times the average no. of cycles per instruction (C/I or CPI), times the cycle time (T). You can reduce the execution time by: * reducing I: provide more powerful, more expressive instructions and/or function units that "cover" the computation in fewer instructions; * reducing CPI: reduce penalties, interlocks, overheads, and latencies (branches, loads); or * reducing T: pipeline deeper, and/or replicate resources instead of sharing them. The trick, of course, is to reduce some of I, CPI, or T, without increasing the other terms! For example, if you add hardware to do sign-extending load byte instructions (reducing I) you must be careful not to increase the cycle time T. Multiprocessor Throughput But in a multiprocessor running a parallelizable computation, we can spread the work out over NPE processing elements, so the execution time equation (in ideal circumstances) approximates time = I * CPI * T / NPE So here is how this manifests itself at design time, and where "liberating" comes into the picture. Say you have a very austere PE CPU. This processor provides enough primitive operations to "cover" the integer C operator repertoire, in one or more instructions. For example, like the GR CPUs, your machine may provide AND and XOR natively, but implement OR through A OR B = (A AND B) XOR A XOR B. Or, it may have single-bit shift left (ADD) and right (SRx), but lack multi-bit shift instructions. Or, it may implement multiply as a series of adds. Now, as a designer, imagine you are staring at a benchmark result, and you see that your benchmark would run 5% faster if your PE had a hardware (multi-bit) barrel shifter function unit. You know that a barrel shifter would add (say) 64 LUTs to your 180 LUTs processor. Maybe you even do a little experiment, and try adding the barrel shifter to your design, to verify that the larger CPU datapath does not inadvertently impact cycle time. So, do you add the feature or leave it out? In a "pedal to the metal" uniprocessor, you might indeed choose to grow your processor by 30% to get a 5% speedup. Especially if that change would help to keep your product ahead of your rival's offering. But in a chip multiprocessor, where the entire die is tiled with PEs, if you make your PE 30% larger, this could well reduce NPE by 30%, or more. (Or it might not. It's a nonlinear function and some other resource, like BRAM ports, might be the critical constraint.) When n% Larger PEs Implies n% Fewer PEs Against this backdrop of "n% larger PEs implies n% fewer PEs", many "conventional wisdom" must-have features, such as barrel shifters, multipliers, perhaps even byte load/store alignment logic, are indefensible. The problem for our friend the barrel shifter is this. Even if it dramatically speeds up a bitfield extract operation, from 10 cycles down to 1 cycle, there just aren't (dynamically) that many shift instructions in the instruction mix. So the total speed up (as posited above) was only 5%. Compare the barrel shifter multiplexer LUTs to the other parts of the CPU. The instruction decoder, the register file, the ALU, the PC incrementer, the instruction RAM port -- these are used nearly every darn cycle. Putting these resources in hardware is hardware well spent. Lots of utilization. But our friend the barrel shifter would only be used maybe every 20 instructions. Overall, the multiprocessor will have higher aggregate throughput if you leave out these occasional-use features and thereby build more PEs per die. Indulge me to name this principle Jan's Razor. Jan's Razor: In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die. Fractional Function Units, Resource Factoring and Sharing Yet as we design multiprocessor systems-on-chip, we have a new flexibility, and the barrel shifters of the world get a reprieve. When designing a uniprocessor, you essentially are performing an integer (as opposed to linear) programming problem. You must provide some integer 0, 1, 2, etc. ALUs, shifters, memory ports, etc. But in the context of a multiprocessor, you can provide fractional resources to each processing element. In our scenario above, we really wanted to add, say, 1/10 of a barrel shifter to our processor -- something that ran at full speed but only for one cycle in every ten, on average, at a cost of 1/10 of the area of a full barrel shifter. You can't do that in a uniprocessor. You can do that in a multiprocessor. Imagine you have a cluster of ten tiny, austere PEs, and assume these PEs already have some kind of shared access bus, for example for access to cluster-shared-memory. To the cluster, add a shared barrel shifter, accessed over this same shared access bus. Now, on average, each PE can issue a fast shift instruction every tenth instruction, for a cost per PE of only 1/10th of a barrel shifter. Continuing with this example, imagine the ten processors are running a computation kernel that needs to issue a multiply about every five instructions. Here we add two multipliers to the cluster. Statistically the multipliers will run hot (fully utilized), and each PE will run about as fast as if it had a dedicated multiplier, but at a cost of only about 1/5 of a multiplier per PE. Here we see this fractional function unit approach also provides a new dimension of design flexibility. Depending upon the computational requirements (e.g. the inner loops you find yourself running today), one may be able to add arbitrary numbers of new function units and coprocessors to a cluster without disturbing the carefully hand-tuned processing element tiles themselves. Rethinking Multiprocessor Architecture Once you start really thinking about resource factoring and sharing across PEs in a multiprocessor, it turns a lot of conventional wisdom on its head, and you look at old problems in new ways. In particular, you start inventorying the parts of a CPU that are not used every cycle, and you start asking yourself whether each processing element deserves an integral or, rather, a fractional instance of that resource. Such occasional-use resources include barrel shifter and multiplier of course, but also load/store byte align hardware, i-cache miss handling logic, d-cache tag checking, d-cache miss handling, and (most significantly) an MMU. Let me address the last few of these. Loads and stores happen every few cycles. Maybe each PE doesn't deserve more than one third of a data cache! ... Perhaps it is not unreasonable to have a single shared d-cache between a small cluster of processors. The MMU is another example. Even if each PE has its own private i-cache and d-cache, if these caches are virtually indexed and tagged, then only on a cache miss do we need to do an MMU address translation and access validation. If the average processor has a cache miss (say) every twenty cycles, and if an MMU access can issue each cycle, then it is not unreasonable to consider sharing a single MMU across ten PEs in a cluster. In Summary It is wrong to design chip multiprocessors by tiling together some number of instances of die shrunk monolithic uniprocessor cores, each with a full complement of mostly-idling function units. A more thoughtful approach, achieving fractional function units by careful resource sharing, would appear to yield significant PE area reductions, maximizing overall multiprocessor throughput, and providing the system designer with a new dimension of design flexibility. Jan Gray Gray Research LLC