Multiprocessors, Jan's Razor, resource sharing, and all that

Started by Jan Gray March 5, 2002
[From today's entry at FPGA CPU News]

Multiprocessors, Jan's Razor, resource sharing, and all that

Continuing our multiprocessor (MP) discussion of Saturday.

Today's theme: Compared to uniprocessor design, building a large-N
(N>20) multiprocessor out of compact, austere soft processor core + BRAM
processing elements (PEs) is challenging and ... liberating! Liberating
in the sense that the hard trade-offs are easier to make due to Jan's
Razor (to coin a phrase), the principle that once you achieve a
minimalist instruction set that covers your computation, any additional
functionality per processor is usually unprofitable and therefore
rejected. This drive to leave things out is moderated by the new
opportunity, unique to MPs, to apply resource factoring to share
instances of critical but relatively-infrequently-used resources between
PEs.

Permit me to explain.

Uniprocessor Throughput

The greater part of processor and computer system design is deciding
what features to put in, and what features to leave out. You include a
feature if its benefits outweigh its costs. Usually these costs and
benefits are measured in terms of throughput, latency, area, power,
complexity, and so forth.

Designers typically focus on throughput, that is, the reciprocal of
execution time. In a uniprocessor, the execution time of a computation
is the product of no. of instructions (I), times the average no. of
cycles per instruction (C/I or CPI), times the cycle time (T). You can
reduce the execution time by:

* reducing I: provide more powerful, more expressive instructions
and/or function units that "cover" the computation in fewer
instructions;

* reducing CPI: reduce penalties, interlocks, overheads, and latencies
(branches, loads); or

* reducing T: pipeline deeper, and/or replicate resources instead of
sharing them.

The trick, of course, is to reduce some of I, CPI, or T, without
increasing the other terms! For example, if you add hardware to do
sign-extending load byte instructions (reducing I) you must be careful
not to increase the cycle time T.

Multiprocessor Throughput

But in a multiprocessor running a parallelizable computation, we can
spread the work out over NPE processing elements, so the execution time
equation (in ideal circumstances) approximates

time = I * CPI * T / NPE

So here is how this manifests itself at design time, and where
"liberating" comes into the picture.

Say you have a very austere PE CPU. This processor provides enough
primitive operations to "cover" the integer C operator repertoire, in
one or more instructions. For example, like the GR CPUs, your machine
may provide AND and XOR natively, but implement OR through A OR B = (A
AND B) XOR A XOR B. Or, it may have single-bit shift left (ADD) and
right (SRx), but lack multi-bit shift instructions. Or, it may implement
multiply as a series of adds.

Now, as a designer, imagine you are staring at a benchmark result, and
you see that your benchmark would run 5% faster if your PE had a
hardware (multi-bit) barrel shifter function unit. You know that a
barrel shifter would add (say) 64 LUTs to your 180 LUTs processor. Maybe
you even do a little experiment, and try adding the barrel shifter to
your design, to verify that the larger CPU datapath does not
inadvertently impact cycle time. So, do you add the feature or leave it
out?

In a "pedal to the metal" uniprocessor, you might indeed choose to grow
your processor by 30% to get a 5% speedup. Especially if that change
would help to keep your product ahead of your rival's offering.

But in a chip multiprocessor, where the entire die is tiled with PEs, if
you make your PE 30% larger, this could well reduce NPE by 30%, or more.
(Or it might not. It's a nonlinear function and some other resource,
like BRAM ports, might be the critical constraint.)

When n% Larger PEs Implies n% Fewer PEs

Against this backdrop of "n% larger PEs implies n% fewer PEs", many
"conventional wisdom" must-have features, such as barrel shifters,
multipliers, perhaps even byte load/store alignment logic, are
indefensible.

The problem for our friend the barrel shifter is this. Even if it
dramatically speeds up a bitfield extract operation, from 10 cycles down
to 1 cycle, there just aren't (dynamically) that many shift instructions
in the instruction mix. So the total speed up (as posited above) was
only 5%. Compare the barrel shifter multiplexer LUTs to the other parts
of the CPU. The instruction decoder, the register file, the ALU, the PC
incrementer, the instruction RAM port -- these are used nearly every
darn cycle. Putting these resources in hardware is hardware well spent.
Lots of utilization. But our friend the barrel shifter would only be
used maybe every 20 instructions.

Overall, the multiprocessor will have higher aggregate throughput if you
leave out these occasional-use features and thereby build more PEs per
die. Indulge me to name this principle Jan's Razor.

Jan's Razor: In a chip multiprocessor design, strive to
leave out all but the minimal kernel set of features from
each processing element, so as to maximize processing
elements per die.

Fractional Function Units, Resource Factoring and Sharing

Yet as we design multiprocessor systems-on-chip, we have a new
flexibility, and the barrel shifters of the world get a reprieve. When
designing a uniprocessor, you essentially are performing an integer (as
opposed to linear) programming problem. You must provide some integer 0,
1, 2, etc. ALUs, shifters, memory ports, etc. But in the context of a
multiprocessor, you can provide fractional resources to each processing
element.

In our scenario above, we really wanted to add, say, 1/10 of a barrel
shifter to our processor -- something that ran at full speed but only
for one cycle in every ten, on average, at a cost of 1/10 of the area of
a full barrel shifter. You can't do that in a uniprocessor. You can do
that in a multiprocessor.

Imagine you have a cluster of ten tiny, austere PEs, and assume these
PEs already have some kind of shared access bus, for example for access
to cluster-shared-memory. To the cluster, add a shared barrel shifter,
accessed over this same shared access bus. Now, on average, each PE can
issue a fast shift instruction every tenth instruction, for a cost per
PE of only 1/10th of a barrel shifter.

Continuing with this example, imagine the ten processors are running a
computation kernel that needs to issue a multiply about every five
instructions. Here we add two multipliers to the cluster. Statistically
the multipliers will run hot (fully utilized), and each PE will run
about as fast as if it had a dedicated multiplier, but at a cost of only
about 1/5 of a multiplier per PE.

Here we see this fractional function unit approach also provides a new
dimension of design flexibility. Depending upon the computational
requirements (e.g. the inner loops you find yourself running today), one
may be able to add arbitrary numbers of new function units and
coprocessors to a cluster without disturbing the carefully hand-tuned
processing element tiles themselves.

Rethinking Multiprocessor Architecture

Once you start really thinking about resource factoring and sharing
across PEs in a multiprocessor, it turns a lot of conventional wisdom on
its head, and you look at old problems in new ways.

In particular, you start inventorying the parts of a CPU that are not
used every cycle, and you start asking yourself whether each processing
element deserves an integral or, rather, a fractional instance of that
resource.

Such occasional-use resources include barrel shifter and multiplier of
course, but also load/store byte align hardware, i-cache miss handling
logic, d-cache tag checking, d-cache miss handling, and (most
significantly) an MMU.

Let me address the last few of these. Loads and stores happen every few
cycles. Maybe each PE doesn't deserve more than one third of a data
cache! ... Perhaps it is not unreasonable to have a single shared
d-cache between a small cluster of processors.

The MMU is another example. Even if each PE has its own private i-cache
and d-cache, if these caches are virtually indexed and tagged, then only
on a cache miss do we need to do an MMU address translation and access
validation. If the average processor has a cache miss (say) every twenty
cycles, and if an MMU access can issue each cycle, then it is not
unreasonable to consider sharing a single MMU across ten PEs in a
cluster.

In Summary

It is wrong to design chip multiprocessors by tiling together some
number of instances of die shrunk monolithic uniprocessor cores, each
with a full complement of mostly-idling function units.

A more thoughtful approach, achieving fractional function units by
careful resource sharing, would appear to yield significant PE area
reductions, maximizing overall multiprocessor throughput, and providing
the system designer with a new dimension of design flexibility. Jan Gray
Gray Research LLC



Jan Gray wrote:
>
<snip>
> A more thoughtful approach, achieving fractional function units by
> careful resource sharing, would appear to yield significant PE area
> reductions, maximizing overall multiprocessor throughput, and providing
> the system designer with a new dimension of design flexibility.

Remember too the use of RISC machines over CISC processors was do to
better
effective address calculation and simple integer operations over the
larger
CISC set. A feature like 32 bit shifts was a advantage not because a
person
often shifts say 27 bits but rather 1,2,4 bits up or 1 bit down for
simple
shifts and efa calculations. Also as you get more processors on a chip
memory
bandwidth needs to increase, thus a super fast module may not be needed
if it
has to wait for memory.

--
Ben Franchuk - Dawn * 12/24 bit cpu *
www.jetnet.ab.ca/users/bfranchuk/index.html



On Tue, Mar 05, 2002 at 01:07:03PM -0800, Jan Gray thus spake:
> [From today's entry at FPGA CPU News]
>
> Multiprocessors, Jan's Razor, resource sharing, and all that
>

Very well-thought out. I bet H&P would approve.

This might be of germane interest to the topic:

http://www.colorforth.com/25x.html

Think this embodies what you mean?
(the Forth bias of the chip not withstanding..)

Eric



Jan,

This is a beautiful concept! I hope you don't mind a few arguments
against it. :-)

The limitations you claim for uniprocessor design exist only if you
restrict yourself to scalar processors. For superscalar and VLIW
designs, you have a similar freedom to ratio function unit types.
For example, a typical superscalar design with three integer issue
slots will support multiply on only one of those.

Also, I have to doubt your claim of full utilization of shared
function units (and cores) in a cluster, even if you provide them in
a statistically perfect ratio for some application. One problem is
that the optimal ratio varies over time. Another problem is that in
order to use them fully, their use needs to be perfectly scheduled,
in your case even over multiple independent instruction streams...
That is a particularly hard problem, it's often impossible to get
near full utilization even with just a single instruction stream.

IMHO your argument gets a bit stressed when you suggest that things
like cache and load-store resources might be shared. I think you'll
find that sharing such resources among multiple instruction streams
is both complex and costly, but that isn't even my point. The point
is that there is a breathtakingly simple and obvious way to share all
these resources, and keep the scheduling problem tractable: don't use
multiple instruction streams! Instead, use a single instruction
stream to directly control a nicely ratioed set of function units.
This, of course, is known as VLIW...

Sure, there are limits to useful scaling of VLIW cores, so if you
want to scale further, it makes sense to go multiprocessor or SIMD.
I'm not arguing against multiprocessors in general. But the function
unit ratioing and resource sharing already exist at the uniprocessor
level, where they can be used more efficiently.

Finally, there may of course be applications where the relatively
large amount of control provided by the many instruction streams in
your approach (Sea of Cores - SoC?;) are an advantage. The challenge
will be in finding those applications... Can you think of any?

Regards,

- Reinoud




Renoid, thank you for your superb comments.

> The limitations you claim for uniprocessor design exist only
> if you restrict yourself to scalar processors. For
> superscalar and VLIW designs, you have a similar freedom to
> ratio function unit types.
> For example, a typical superscalar design with three integer
> issue slots will support multiply on only one of those.

You're right, and I should have addressed multiple issue PEs -- but (I
think) so am I. Even in a three issue VLIW, if you only need multiply
one out of every ten instruction issue slots (3 or 4 cycles), maybe you
should be sharing your multiply unit with other instances of your PE.

> Also, I have to doubt your claim of full utilization of
> shared function units (and cores) in a cluster, even if you
> provide them in a statistically perfect ratio for some
> application.

Yes, I agree. My math was sloppy -- it was a concept I was trying to
convey.

> One problem is that the optimal ratio varies
> over time. Another problem is that in order to use them
> fully, their use needs to be perfectly scheduled, in your
> case even over multiple independent instruction streams...
> That is a particularly hard problem, it's often impossible to
> get near full utilization even with just a single instruction stream.

It is not so much the perfect scheduling or utilization I was after as
the illusion that (most of the time) you have a dedicated unit even
though you are sharing it with other PEs. I agree that my description
over promises the concept. Put aside the scheduling notion, that's too
hairy. Using statistics or queuing theory you can model how many shared
resources belong with each cluster so as to bound the expected waiting
time for a shared resource to a certain threshold.

[By the way, I write "you can model" because *I can't* anymore -- that
RAM went unrefreshed for too long. :-)]

> IMHO your argument gets a bit stressed when you suggest that
> things like cache and load-store resources might be shared.
> I think you'll find that sharing such resources among
> multiple instruction streams is both complex and costly, but
> that isn't even my point.

We'll see. :-) To some extent, it comes down to this. Some useful
resource, done right, is too expensive to assign to each PE. Either we
do a stripped down and limited subset of resource that is *just*
affordable per PE, or we leave out the resource, and then *share* an
instance, done right, amongst 3 or 5 or 10 PEs.

I know from staring at it that a proper data cache, that also does byte
and halfword alignment, and so forth, rivals the size of an austere
processor data path, and only gets used every 2 or every 3 instructions.
*** Particularly in an implementation fabric that is block RAM port
constrained ***, it is very tempting to me to share one data cache
between 2 or 3 PEs. The implementation cost of the sharing is some
muxing and some arbitration logic, and perhaps some address-space-ID
tags and tag checking. And as I note in the article, once you have paid
for the muxing and the arbitration logic per PE, maybe you don't have to
pay for it over and over again as you share additional resources.

In any event, my goal was to convey the wide applicability of the
concept, and how deeply it may lead you to rethink architecture in a
multiprocessor -- not to specifically champion this or that resource as
something that must be shared.

I understand and appreciate your skepticism and you may well be right!

> The point is that there is a
> breathtakingly simple and obvious way to share all these
> resources, and keep the scheduling problem tractable: don't
> use multiple instruction streams! Instead, use a single
> instruction stream to directly control a nicely ratioed set
> of function units. This, of course, is known as VLIW...

Good, good push back, thank you very much. I love LIWs (see last half
of http://www.fpgacpu.org/usenet/homebrew.html), and I have a nice
design for a three issue 3x21d-bit instruction word LIW percolating
right here (taps noggin). But I have two problems with them. 1) I
don't think VLIWs are the best fit technology-mapping and area-wise for
an FPGA implementation. That is, I believe, for example, that a three
issue LIW will be less efficient (MIPS/LUT) than three instances of a
single issue RISC, even though the latter instances incur three sets of
instruction fetchers, PC incrementers, etc. I'll try to explain that
more, and explore the LIW design space, in a subsequent write-up. 2)
Past about 3-issue, it seems very hard for a compiler to keep all those
issue slots busy with useful work, so they don't scale up enough, and
then you are back to MPs. And wider issue VLIWs need a heroic compiler
research program, which I don't have the resources to chase. (Since all
I have is a hammer, everything looks like a nail to me.)

I agree that an MP of 2- or 3-issue LIWs merits serious consideration
compared to an MP of scalar RISCs.

> I'm not arguing against multiprocessors in general. But the
> function unit ratioing and resource sharing already exist at
> the uniprocessor level, where they can be used more efficiently.

Thank you again. In my enthusiasm for the concept, I did indeed
overlook the points you make.

> Finally, there may of course be applications where the
> relatively large amount of control provided by the many
> instruction streams in your approach (Sea of Cores - SoC?;)
> are an advantage. The challenge will be in finding those
> applications... Can you think of any?

[First, let me note that most of the time, I too would prefer 1 1000
MIPS processor to 10 200 MIPS processors or 100 50 MIPS processors.
That said ...]

[Read along with me, to the sound of future patents flushing:]

I confess, looking at the V600E 60-way MP I described recently, or its
logical follow ons in V2 and so forth, I confess that these are paper
tigers, with a lot of integer MIPS, in want of an application.
Aggregate "D-MIPS" is not an application!

I suppose my pet hand-wavy application for these concept chip-MPs is
lexing and parsing XML and filtering that (and/or parse table
construction for same) -- see http://www.fpgacpu.org/usenet/re.html).
Let me set the stage for you.

Imagine a future in which "web services" are ubiquitous -- the internet
has evolved into a true distributed operating system, a cloud offering
services to several billion connected devices. Imagine that the current
leading transport candidate for internet RPC, namely SOAP -- (Simple
Object Access Protocol, e.g. XML encoded RPC arguments and return
values, on an HTTP transport, with interfaces described in WSDL (itself
based upon XML Schema)) -- imagine SOAP indeed becomes the standard
internet RPC. That's a *ton* of XML flying around. You will want your
routers and firewalls, etc. of the future to filter, classify, route,
etc. that XML at wire speed. That's a *ton* of ASCII lexing, parsing,
and filtering. It's trivially parallelizable -- every second a thousand
or a million separate HTTP sessions flash past your ports -- and
therefore potentially a nice application for rack full of FPGAs, most
FPGAs implementing a 100-way parsing and classification multiprocessor.

Jan Gray
Gray Research LLC




Jan,

I very much agree that efficient large parallel systems have to
provide the right ratio of resources. I also think that control is
just one of the resources to share, and in many cases has to be
shared among multiple function units for best results. From this
point of view, using multiscalar PEs is just a natural extension of
your proposal...

> You're right, and I should have addressed multiple issue PEs -- but (I
> think) so am I. Even in a three issue VLIW, if you only need multiply
> one out of every ten instruction issue slots (3 or 4 cycles), maybe you
> should be sharing your multiply unit with other instances of your PE.

Yes, or maybe you should use VLIW PEs with an issue width of 10 in
that case. For applications with a large amount of parallelism, PEs
with an issue width of around 10 seems to be a sweet spot (beyond
that width basic block sizes tend to become a problem and cycle times
start to suffer). Also, at that width you can usually avoid the
complexities of sharing resources between PEs (except memory of
course, which presents a rather hairy problem all by itself).

Note that there exist VLIW architectures that scale nicely to 10+
issue widths, by using distributed register files and limited
interconnect/bypassing. Also, note that if the amount of parallelism
available is 'embarassing' enough to keep lots of single-issue PEs
busy, this parallelism usually maps well to (significantly cheaper)
VLIW or even SIMD architectures. For example, consider how many apps
vectorize well.

> We'll see. :-) To some extent, it comes down to this. Some useful
> resource, done right, is too expensive to assign to each PE. Either we
> do a stripped down and limited subset of resource that is *just*
> affordable per PE, or we leave out the resource, and then *share* an
> instance, done right, amongst 3 or 5 or 10 PEs.

But it seems to me that the latter will nearly always be less
efficient than sharing said resource with several others within one
multi-issue PE. The single, centralized control will not only be
simpler and cheaper, but also more efficient because of the
scheduling opportunities.

> I know from staring at it that a proper data cache, that also does byte
> and halfword alignment, and so forth, rivals the size of an austere
> processor data path, and only gets used every 2 or every 3 instructions.
> *** Particularly in an implementation fabric that is block RAM port
> constrained ***, it is very tempting to me to share one data cache
> between 2 or 3 PEs.

I fully agree; however, this works just as well for a VLIW. On a
10-issue VLIW you may have just 2 slots for load-store (and none of
the overhead for sharing).

> The implementation cost of the sharing is some
> muxing and some arbitration logic, and perhaps some address-space-ID
> tags and tag checking. And as I note in the article, once you have paid
> for the muxing and the arbitration logic per PE, maybe you don't have to
> pay for it over and over again as you share additional resources.

Wouldn't you like to be able to use multiple shared resources
independently (say, one PE accesses the multiplier, another the
barrel shifter, and a third shared memory, in the same cycle)?
Otherwise you'll be severely limiting the utilization of these
expensive resources! Implementing parallel access to these is much
simpler (and cheaper) when there is a single control source...

> In any event, my goal was to convey the wide applicability of the
> concept, and how deeply it may lead you to rethink architecture in a
> multiprocessor -- not to specifically champion this or that resource as
> something that must be shared.

And I'm trying to make you see control as yet another resource that
can, and often should, be shared - to the point where it obviates the
need for sharing most other resources.

> 1) I don't think VLIWs are the best fit technology-mapping and
> area-wise for an FPGA implementation. That is, I believe, for
> example, that a three issue LIW will be less efficient (MIPS/LUT)
> than three instances of a single issue RISC, even though the latter
> instances incur three sets of instruction fetchers, PC incrementers,
> etc. I'll try to explain that more, and explore the LIW design
> space, in a subsequent write-up.

I'm looking forward to rebutting that future write-up ;-). Note that
the particular variety of VLIW that I think has the most potential
uses explicitly programmed bypasses, which leads to a huge reduction
in bypass (i.e. bus/selector) cost for wide configurations.

> 2) Past about 3-issue, it seems very hard for a compiler to keep all
> those issue slots busy with useful work, so they don't scale up
> enough,

This depends on how much of the parallelism can be extracted as
instruction-level parallelism (ILP). Most applications with abundant
parallelism map very well to 10+ issue widths (VLIW or SIMD/vector),
at least if coded reasonably. The 3-issue ILP limit you mention is
for essentially *sequential* code.

For that matter, are you aware of any compiler that can keep more
than 3 _PEs_ busy? (And if you feel it's okay to manually
parallelize code for multiple single-issue PEs, then it also seems
reasonable to manually schedule VLIWs...)

> and then you are back to MPs.
>
> And wider issue VLIWs need a heroic compiler research program, which
> I don't have the resources to chase. (Since all I have is a hammer,
> everything looks like a nail to me.)

Well... What if your PEs have to communicate/synchronize a lot?
Many apps require that to be able to use the parallelism. Now,
within a single instruction stream, synchronization is implicit and
communication is easy and cheap. So with VLIW PEs, you can go a long
way by partitioning for minimum communication, and scheduling the
operations within a node. Yes, that's hard, but is it harder than
targetting your massively parallel shared-memory MP? Do you have a
compiler which can target one of your shared-memory clusters (from
your regular sequential HLL source)?. Isn't this even _harder_ than
scheduling for VLIW? Aren't you just accepting that you'll have to
program your intra-cluster communication explicitly, and would it not
be fair, then, to accept manual scheduling for a VLIW PE also?

Your approach is beautiful and simple in concept. However, I'm not
buying your arguments that it is particularly efficient or easy to
use :-). That's true only for apps which map trivially onto that
particular structure, but then the same could be said for VLIWs or
any other programmable structure...

On a final note, there are several VLIW compilers/schedulers more or
less freely available, each with its particular limitations of course
(I don't have up-to-date pointers, but can ask a colleague for them
if you like).

> I suppose my pet hand-wavy application for these concept chip-MPs is
> lexing and parsing XML and filtering that (and/or parse table
> construction for same) -- see http://www.fpgacpu.org/usenet/re.html).

Very interesting indeed!

Best regards,

- Reinoud