Sign in

username:

password:



Not a member?

Search fpga-cpu



Search tips

Subscribe to fpga-cpu



fpga-cpu by Keywords

Altera | CISCifying | IDE | ISA | Java | JHDL | JTAG | LBU | MicroBlaze | PAR | PCI | RISC | SoC | Spartan | Transputers | Verilog | VHDL | Virtex | VLIW | WebPack | Xilinx | Xsoc | YARD-1A

Ads

Discussion Groups

Discussion Groups | FPGA-CPU | Re: CPU Architectural Question

This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).

CPU Architectural Question - Author Unknown - Jul 28 7:34:00 2001

As I understand it, a superscaler architecture is an attempt to
exploit ILP to boost performance. A superscaler architecture can
issue about four or five instructions at once (is this right) ?
Wouldn't it be possible to obtain much the same performance boost by
simply using a simpler (single issue) processor and clocking it four
or five times faster ?

The reason I ask is I'm wondering about the difference between cache
performance and processor core performance. Is it reasonable to
assume that as time progresses cache performance will lag behind
processor performance ? Or will these two factors remain reasonably
balanced for the forseeable future? Current architectures are
supported by a cache bandwidth setup to match core performance. If
core performance outruns possible cache bandwidth, won't the core end
up idling anyway ? In which case ILP isn't as important ?

Rob http://www.birdcomputer.ca





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )


Re: CPU Architectural Question - Ben Franchuk - Jul 29 3:37:00 2001

Josh Fryman wrote:

> in essence, this is what happened between RISC and CISC. RISC
> advocated using a simpler instruction set and more instructions;
> the clock speed could be cranked through the roof while the code
> size was slightly larger. the CISC model was to have HLL constructs
> for things like "FOR" loops and what not; the RISC model was to use
> just the barest instructions possible and a register architecture.

No 3 factors that slowed CISC's down.
1) instruction sets geared to save memory
2) legacy formats to be compatible with
3) Few usable registers > RISC won out for a lot of reasons, but mostly because they were right
> about speed and performance.

No - they where wrong -- they get speed because with the large
register set as they only have to fetch instructions.With banked memory
you can read instructions real fast.
the x86 series since the Pentium Pro
> have all taken advantage of this by hardware conversion of CISC to
> "rops", or "RISC ops" inside ... and the instructions are actually
> executed as a series of RISC ones rather than one big complex CISC
> one. i believe even the IA-64 has a small hardware module in it
> that will xlate ia32 CISC into ia64 RISC. the difference is that
> the ia64 has a real RISC instruction set, so you can stop using the
> crappy x86 one.

that is called micro coding - no risc inside.
microcoding ouside the cpu is called risc.
Ben.
--
Standard Disclaimer : 97% speculation 2% bad grammar 1% facts.
"Pre-historic Cpu's" http://www.jetnet.ab.ca/users/bfranchuk
Now with schematics.





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: CPU Architectural Question - Josh Fryman - Jul 29 12:12:00 2001


in short, "no".

in long, this is a non-trivial question. performance always depends
on the underlying application(s) you run. but, to get into specifics
then while ignoring app-specific details...

> As I understand it, a superscaler architecture is an attempt to
> exploit ILP to boost performance. A superscaler architecture can
> issue about four or five instructions at once (is this right) ?

by definition (H&P) superscalar is the ability to issue more than
one instruction into function units. that is, you can have several
instructions in the various function units, all working simultaneously
if not necessarily together - a MUL, an ADD, a couple of LD/ST, etc.
the instructions must meet certain dependency and resource constraints
to be executed in parallel, and the hardware at run-time makes the
decisions on what will be done in parallel. contrast this to VLIW,
which is typically compiler-assigned bundles of parallel instructions.
of course, there are exceptions to both models, but in general, this
is the core idea.

so yes, superscalar and vliw both exploit ILP. the problem is that
most code only has an ILP of 1.5-2.5, so going beyond a few execution
units doesn't make sense. this suggests that have 3 MUL units would
probably be 1 too many; having 2 or 3 LD/ST units is about right,
and based on frequency of Fxx ops, FP units are kept to a minimum.

note this is different from speculation too.

> Wouldn't it be possible to obtain much the same performance boost by
> simply using a simpler (single issue) processor and clocking it four
> or five times faster ?

in essence, this is what happened between RISC and CISC. RISC
advocated using a simpler instruction set and more instructions;
the clock speed could be cranked through the roof while the code
size was slightly larger. the CISC model was to have HLL constructs
for things like "FOR" loops and what not; the RISC model was to use
just the barest instructions possible and a register architecture.

RISC won out for a lot of reasons, but mostly because they were right
about speed and performance. the x86 series since the Pentium Pro
have all taken advantage of this by hardware conversion of CISC to
"rops", or "RISC ops" inside ... and the instructions are actually
executed as a series of RISC ones rather than one big complex CISC
one. i believe even the IA-64 has a small hardware module in it
that will xlate ia32 CISC into ia64 RISC. the difference is that
the ia64 has a real RISC instruction set, so you can stop using the
crappy x86 one.

to make a simpler system, you'd need to simplify the instruction set
further. probably just have LD/ST, ROL, ROR, LSL, LSR, OR, AND, BEQ.
anything else (like ADD, MUL, etc) would have to be tossed out. you'd
then have to implement s/w libraries for these complex instructions.
you'd have a system that could scream at 2GHz easily, but would take
30-100 instructions to do a basic "a = x*y;" evaluation, so you would
immediately lose the benefit of the increased speed.

to go to single issue isn't a big deal. the need for high clock is
what is the problem. if you want a high clock, there's very
little you can do in any given clock cycle. that says that you can
have a limited fanout from one gate to the next sequence before you
have clock skew and other problems showing up. at 1GHz, it's very
hard to get the clock to move far across a device.

the problem is that you need high speed to compete with current
technology. that says you can't have a one-cycle-does-all system
because in one cycle at 1GHz you'll have a hard time doing much
at all. (thus the very simple bit shift ops and bit logical ops
i suggested. anything beyond this is way too complex.)

this is why the P4 pipeline is 20 stages. they need 20 stages to
get each instruction fully executed. this is also why running non-P4
code on a P4 is a bad performance loss ... each time a stall occurs
or a pipeline flush happens, you lose 20 cycles of work. a 1.7GHz
processor doesn't cover that loss. recompile for a P4-scheduled
app, and you get a nice performance boost rather than big loss.

> The reason I ask is I'm wondering about the difference between cache
> performance and processor core performance. Is it reasonable to
> assume that as time progresses cache performance will lag behind
> processor performance ?

caches don't generally lag behind processors. main memory is the
problem, not the cache. if you're not familiar with H&P, you should
read it ... it will give you lots to think about. anyway, the point
is that CPI, or cycles per instruction, is measured in a lot of ways.

the issue with memory is that if you assume all of your instructions
can execute in one cycle, it makes the analysis easier to understand.
so it looks like this:

CPI = (L1-hit-rate * L1-hit-time) + (L1-miss-rate * L2-eff)
L2-eff = (L2-hit-rate * L2-hit-time) + (L2-miss-rate * L3-eff)
etc

most L1 caches will have a reasonable (85-90%) hit rate. the trick
of the L1 cache is to keep the design simple and sufficiently small
(and square in layout) that you can cover the L1 in 1 cycle. this
means your CPI isn't governed by L1, but by the lower memory layers.
L2 access can vary from 2-5 cycles typically, but will have a similar
hit rate of mid- to high-90's percentile. so you can see the problem
actually is main memory, which has an access time in the 40-250 cycle
range. the size is further reduced from requiring multi-port access
to SRAM cache lines, with various stage forwarding requirements and
multiple issue problems.

the constraint is mostly size and density. you want a fast cache,
so it can only be a certain size - depending on manufacturing method.
this is why the fancy systems like CRAY don't use SDRAM or DDR SDRAM
or RDRAM or whatever - they use SRAM only, even in main memory. fast
SRAM chips will have an access time of just a few ns, bringing down
main memory access time phenomenally well. the drawback is the
prohibitive cost.

> Or will these two factors remain reasonably
> balanced for the forseeable future? Current architectures are
> supported by a cache bandwidth setup to match core performance. If
> core performance outruns possible cache bandwidth, won't the core end
> up idling anyway ? In which case ILP isn't as important ?

the cache isn't the problem. it's main memory that's the problem.
memory technology advances very slowly in relation to cpu. the
"bandwidth" is actually the cache line size, which has other problems
in critical-word-first or early override and interruptability.

the big advantage of multiple issue is that memory accesses can be
pipelines, and you can "cover" for expected stalls. if your code
would normally look like a simple loop:

for (i=0; i<KOUNT; i++)
{
c = x[i] * k;
... do some stuff ...
}

you can unroll the loop to hide the latency of memory accesses and
*not* stall simply because you *can* execute more than one instruction
at a time. this is why we use multiple issue systems. the drawback
is the extreme-ism of things like the P4. until you get your code
converted / recompiled, new architectures seem like dogs for performance.

the trick in any architecture is to have enough LD/ST and other functional
units you can "hide" the latency like this. that's always a big problem
to evaluate properly. even system simulations like SimpleScalar and
others suck when compared to real hardware.

see "Computer Architecture A Quantitative Approach" by John L. Hennessy
and David A. Patterson, 2nd Ed. (the H&P reference.) this is the
"bible" of computer architecture, although it doesn't get too much into
parallel systems. that's covered by other books. you probably want to
focus on Ch's 3-5.

-josh






(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: CPU Architectural Question - Josh Fryman - Jul 29 20:38:00 2001

> No 3 factors that slowed CISC's down.
> 1) instruction sets geared to save memory
debatable, but i'll certainly accept it was a factor. #1 is
what i'd question.
> 2) legacy formats to be compatible with
*every* ISA has this problem. sparc, mips, etc. intel
has perhaps carried it to an extreme, but they have also
showed that market forces can make a success out of a bad
design, too.
> 3) Few usable registers
don't know if i'd agree with this. the MC68000 series certainly
didn't have "few" registers to work with. (compared to the
intel system, it had a bountiful harvest of them.) compared
to a Sparc or MIPS, it didn't have so many. but it was nice
to work with anyway. *some* systems definitely had a limited
register set that really hampered them.

> No - they where wrong -- they get speed because with the large
> register set as they only have to fetch instructions.With banked memory
> you can read instructions real fast.

this is a point of debate. it's a debate that's been going on a long time.
i guess it shows where i stand on it ;)

the whole point of RISC was to use a larger register set and then always
work from registers with simple, fast instructions. the idea of CISC was
never to target something like that. CISC was meant to serve high level
languages - one instruction for one programming construct (for loop, array
reference, etc). the hybrids that came later are different stories, but
clung to the same model in various forms.

thus, i argue that the RISC people were right. simple instructions execute
very fast. using only registers (no direct memory using ops except ld/st)
is also a big performance boost. all the addressing modes and complex
instruction encoding formats as well as complex instructions limited CISC
to always be a much slower architecture. this is why the pentium was never
clocked very high; they had to redesign the core with the PPro to move
to RISC core (see below) to get higher clocks. that's also why all the
systems since the PPro have used variations on the PPro core. because
they fixed the limitations of the CISC system to a certain extent. the P4
is the first real improvement on the PPro design.

you could also (and probably should) argue that RISC was a natural evolution
for the system to go to with memory becoming more abundant and better
manufacturing models being available for things like on-chip caches of
noticeable size. the hold-on of CISC is just from market ($$) forces.

> that is called micro coding - no risc inside.
> microcoding ouside the cpu is called risc.

i'll try to dig up the references. my understanding is that it *does*
use a risc subset inside. all CPUs implement microcoding. it's a long
established technique. but with the ppro+ systems, they translate the
instructions down to a simpler set of commands higher than microcode.
these commands were executed in turn by the microcode systems. intel
put out some papers on the topic, iirc. i'll dig 'em up.





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )

Re: CPU Architectural Question - Author Unknown - Jul 30 3:03:00 2001

--- In fpga-cpu@y..., Josh Fryman <fryman@c...> wrote:
>
> in short, "no".
>
> so yes, superscalar and vliw both exploit ILP. the problem is that
> most code only has an ILP of 1.5-2.5, so going beyond a few
execution
>
Part of my point. The ILP is relatively low. So even with a
superscaler architecture, it is not going to triple (lucky to double)
the performance of the processor.
>
> the ia64 has a real RISC instruction set, so you can stop using the
> crappy x86 one.

I've studied the IA64 ISA and come to a conclusion that it's crappy
as well. > to make a simpler system, you'd need to simplify the instruction set
> further. probably just have LD/ST, ROL, ROR, LSL, LSR, OR, AND,
BEQ.
> anything else (like ADD, MUL, etc) would have to be tossed out.
you'd
> then have to implement s/w libraries for these complex instructions.
> you'd have a system that could scream at 2GHz easily, but would take
> 30-100 instructions to do a basic "a = x*y;" evaluation, so you

When I was refering to simple processor, I meant simply not going
superscaler, not eliminating basic instructions like ADD. A
superscaler processor is extremely complex, containing zillions of
comparators, a large register array for renaming, multiple execution
units and lots of data paths. I'll guess a single issue RISC
processor would be an order of magnitude smaller. Being so much
smaller shouldn't it be possible to have it scream at 2-3GHz ? Or
does the size of the core not end up making that much difference to
the clock cycle ?
>
> to go to single issue isn't a big deal. the need for high clock is
> what is the problem. if you want a high clock, there's very
> little you can do in any given clock cycle. that says that you can
> have a limited fanout from one gate to the next sequence before you
> have clock skew and other problems showing up. at 1GHz, it's very
> hard to get the clock to move far across a device.
>
> the problem is that you need high speed to compete with current
> technology.
>
Part of the gist of my question was with reference to *future*
process technology. I think superscaler *is* the way to go with
today's technology. I'll have to admit I don't know the technology,
so perhaps the difference in processor size does not make the
difference my guts tells me it should. But, I've read for instance
that the Compaq Alpha takes two cycles to access the L1 cache and
multiple wide read ports are required to keep up with the processor
core. If the L1 cache truly operates at the same speed as the core,
then why did they do this ? Was this a result of improvments in
process technology, and shouldn't the same trend hold in the future ?
To me it looks like additional complex instructions are being added
to ISA's to better use limited cache bandwidth.
>
> caches don't generally lag behind processors.
>
> the trick
> of the L1 cache is to keep the design simple and sufficiently small
> (and square in layout) that you can cover the L1 in 1 cycle.

These two points seem opposed. Why is it such a trick to keep the L1
cache operating in a single cycle if it's just as fast as the
processor core ?

> to evaluate properly. even system simulations like SimpleScalar and
> others suck when compared to real hardware.

So, is it safe to assume that someone has actually tried comparing a
speed optmizied single issue processor to a superscaler one in the
same technology ?
>
> see "Computer Architecture A Quantitative Approach" by John L.
Hennessy
> and David A. Patterson, 2nd Ed. (the H&P reference.) this is the
> "bible" of computer architecture, although it doesn't get too much
into
>
I have read H&P, that's why I have questions.

Sorry if I seem a bit argumentative, but I just want to be certain a
superscaler architecture will not be made obsolete by future process
improvements, before I spend a lot of time on one.

Thanks
Rob http://www.birdcomputer.ca




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )