This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
As I understand it, a superscaler architecture is an attempt to exploit ILP to boost performance. A superscaler architecture can issue about four or five instructions at once (is this right) ? Wouldn't it be possible to obtain much the same performance boost by simply using a simpler (single issue) processor and clocking it four or five times faster ? The reason I ask is I'm wondering about the difference between cache performance and processor core performance. Is it reasonable to assume that as time progresses cache performance will lag behind processor performance ? Or will these two factors remain reasonably balanced for the forseeable future? Current architectures are supported by a cache bandwidth setup to match core performance. If core performance outruns possible cache bandwidth, won't the core end up idling anyway ? In which case ILP isn't as important ? Rob http://www.birdcomputer.ca |
|
|
|
Josh Fryman wrote: > in essence, this is what happened between RISC and CISC. RISC > advocated using a simpler instruction set and more instructions; > the clock speed could be cranked through the roof while the code > size was slightly larger. the CISC model was to have HLL constructs > for things like "FOR" loops and what not; the RISC model was to use > just the barest instructions possible and a register architecture. No 3 factors that slowed CISC's down. 1) instruction sets geared to save memory 2) legacy formats to be compatible with 3) Few usable registers > RISC won out for a lot of reasons, but mostly because they were right > about speed and performance. No - they where wrong -- they get speed because with the large register set as they only have to fetch instructions.With banked memory you can read instructions real fast. the x86 series since the Pentium Pro > have all taken advantage of this by hardware conversion of CISC to > "rops", or "RISC ops" inside ... and the instructions are actually > executed as a series of RISC ones rather than one big complex CISC > one. i believe even the IA-64 has a small hardware module in it > that will xlate ia32 CISC into ia64 RISC. the difference is that > the ia64 has a real RISC instruction set, so you can stop using the > crappy x86 one. that is called micro coding - no risc inside. microcoding ouside the cpu is called risc. Ben. -- Standard Disclaimer : 97% speculation 2% bad grammar 1% facts. "Pre-historic Cpu's" http://www.jetnet.ab.ca/users/bfranchuk Now with schematics. |
|
in short, "no". in long, this is a non-trivial question. performance always depends on the underlying application(s) you run. but, to get into specifics then while ignoring app-specific details... > As I understand it, a superscaler architecture is an attempt to > exploit ILP to boost performance. A superscaler architecture can > issue about four or five instructions at once (is this right) ? by definition (H&P) superscalar is the ability to issue more than one instruction into function units. that is, you can have several instructions in the various function units, all working simultaneously if not necessarily together - a MUL, an ADD, a couple of LD/ST, etc. the instructions must meet certain dependency and resource constraints to be executed in parallel, and the hardware at run-time makes the decisions on what will be done in parallel. contrast this to VLIW, which is typically compiler-assigned bundles of parallel instructions. of course, there are exceptions to both models, but in general, this is the core idea. so yes, superscalar and vliw both exploit ILP. the problem is that most code only has an ILP of 1.5-2.5, so going beyond a few execution units doesn't make sense. this suggests that have 3 MUL units would probably be 1 too many; having 2 or 3 LD/ST units is about right, and based on frequency of Fxx ops, FP units are kept to a minimum. note this is different from speculation too. > Wouldn't it be possible to obtain much the same performance boost by > simply using a simpler (single issue) processor and clocking it four > or five times faster ? in essence, this is what happened between RISC and CISC. RISC advocated using a simpler instruction set and more instructions; the clock speed could be cranked through the roof while the code size was slightly larger. the CISC model was to have HLL constructs for things like "FOR" loops and what not; the RISC model was to use just the barest instructions possible and a register architecture. RISC won out for a lot of reasons, but mostly because they were right about speed and performance. the x86 series since the Pentium Pro have all taken advantage of this by hardware conversion of CISC to "rops", or "RISC ops" inside ... and the instructions are actually executed as a series of RISC ones rather than one big complex CISC one. i believe even the IA-64 has a small hardware module in it that will xlate ia32 CISC into ia64 RISC. the difference is that the ia64 has a real RISC instruction set, so you can stop using the crappy x86 one. to make a simpler system, you'd need to simplify the instruction set further. probably just have LD/ST, ROL, ROR, LSL, LSR, OR, AND, BEQ. anything else (like ADD, MUL, etc) would have to be tossed out. you'd then have to implement s/w libraries for these complex instructions. you'd have a system that could scream at 2GHz easily, but would take 30-100 instructions to do a basic "a = x*y;" evaluation, so you would immediately lose the benefit of the increased speed. to go to single issue isn't a big deal. the need for high clock is what is the problem. if you want a high clock, there's very little you can do in any given clock cycle. that says that you can have a limited fanout from one gate to the next sequence before you have clock skew and other problems showing up. at 1GHz, it's very hard to get the clock to move far across a device. the problem is that you need high speed to compete with current technology. that says you can't have a one-cycle-does-all system because in one cycle at 1GHz you'll have a hard time doing much at all. (thus the very simple bit shift ops and bit logical ops i suggested. anything beyond this is way too complex.) this is why the P4 pipeline is 20 stages. they need 20 stages to get each instruction fully executed. this is also why running non-P4 code on a P4 is a bad performance loss ... each time a stall occurs or a pipeline flush happens, you lose 20 cycles of work. a 1.7GHz processor doesn't cover that loss. recompile for a P4-scheduled app, and you get a nice performance boost rather than big loss. > The reason I ask is I'm wondering about the difference between cache > performance and processor core performance. Is it reasonable to > assume that as time progresses cache performance will lag behind > processor performance ? caches don't generally lag behind processors. main memory is the problem, not the cache. if you're not familiar with H&P, you should read it ... it will give you lots to think about. anyway, the point is that CPI, or cycles per instruction, is measured in a lot of ways. the issue with memory is that if you assume all of your instructions can execute in one cycle, it makes the analysis easier to understand. so it looks like this: CPI = (L1-hit-rate * L1-hit-time) + (L1-miss-rate * L2-eff) L2-eff = (L2-hit-rate * L2-hit-time) + (L2-miss-rate * L3-eff) etc most L1 caches will have a reasonable (85-90%) hit rate. the trick of the L1 cache is to keep the design simple and sufficiently small (and square in layout) that you can cover the L1 in 1 cycle. this means your CPI isn't governed by L1, but by the lower memory layers. L2 access can vary from 2-5 cycles typically, but will have a similar hit rate of mid- to high-90's percentile. so you can see the problem actually is main memory, which has an access time in the 40-250 cycle range. the size is further reduced from requiring multi-port access to SRAM cache lines, with various stage forwarding requirements and multiple issue problems. the constraint is mostly size and density. you want a fast cache, so it can only be a certain size - depending on manufacturing method. this is why the fancy systems like CRAY don't use SDRAM or DDR SDRAM or RDRAM or whatever - they use SRAM only, even in main memory. fast SRAM chips will have an access time of just a few ns, bringing down main memory access time phenomenally well. the drawback is the prohibitive cost. > Or will these two factors remain reasonably > balanced for the forseeable future? Current architectures are > supported by a cache bandwidth setup to match core performance. If > core performance outruns possible cache bandwidth, won't the core end > up idling anyway ? In which case ILP isn't as important ? the cache isn't the problem. it's main memory that's the problem. memory technology advances very slowly in relation to cpu. the "bandwidth" is actually the cache line size, which has other problems in critical-word-first or early override and interruptability. the big advantage of multiple issue is that memory accesses can be pipelines, and you can "cover" for expected stalls. if your code would normally look like a simple loop: for (i=0; i<KOUNT; i++) { c = x[i] * k; ... do some stuff ... } you can unroll the loop to hide the latency of memory accesses and *not* stall simply because you *can* execute more than one instruction at a time. this is why we use multiple issue systems. the drawback is the extreme-ism of things like the P4. until you get your code converted / recompiled, new architectures seem like dogs for performance. the trick in any architecture is to have enough LD/ST and other functional units you can "hide" the latency like this. that's always a big problem to evaluate properly. even system simulations like SimpleScalar and others suck when compared to real hardware. see "Computer Architecture A Quantitative Approach" by John L. Hennessy and David A. Patterson, 2nd Ed. (the H&P reference.) this is the "bible" of computer architecture, although it doesn't get too much into parallel systems. that's covered by other books. you probably want to focus on Ch's 3-5. -josh |
|
|
|
> No 3 factors that slowed CISC's down. > 1) instruction sets geared to save memory debatable, but i'll certainly accept it was a factor. #1 is what i'd question. > 2) legacy formats to be compatible with *every* ISA has this problem. sparc, mips, etc. intel has perhaps carried it to an extreme, but they have also showed that market forces can make a success out of a bad design, too. > 3) Few usable registers don't know if i'd agree with this. the MC68000 series certainly didn't have "few" registers to work with. (compared to the intel system, it had a bountiful harvest of them.) compared to a Sparc or MIPS, it didn't have so many. but it was nice to work with anyway. *some* systems definitely had a limited register set that really hampered them. > No - they where wrong -- they get speed because with the large > register set as they only have to fetch instructions.With banked memory > you can read instructions real fast. this is a point of debate. it's a debate that's been going on a long time. i guess it shows where i stand on it ;) the whole point of RISC was to use a larger register set and then always work from registers with simple, fast instructions. the idea of CISC was never to target something like that. CISC was meant to serve high level languages - one instruction for one programming construct (for loop, array reference, etc). the hybrids that came later are different stories, but clung to the same model in various forms. thus, i argue that the RISC people were right. simple instructions execute very fast. using only registers (no direct memory using ops except ld/st) is also a big performance boost. all the addressing modes and complex instruction encoding formats as well as complex instructions limited CISC to always be a much slower architecture. this is why the pentium was never clocked very high; they had to redesign the core with the PPro to move to RISC core (see below) to get higher clocks. that's also why all the systems since the PPro have used variations on the PPro core. because they fixed the limitations of the CISC system to a certain extent. the P4 is the first real improvement on the PPro design. you could also (and probably should) argue that RISC was a natural evolution for the system to go to with memory becoming more abundant and better manufacturing models being available for things like on-chip caches of noticeable size. the hold-on of CISC is just from market ($$) forces. > that is called micro coding - no risc inside. > microcoding ouside the cpu is called risc. i'll try to dig up the references. my understanding is that it *does* use a risc subset inside. all CPUs implement microcoding. it's a long established technique. but with the ppro+ systems, they translate the instructions down to a simpler set of commands higher than microcode. these commands were executed in turn by the microcode systems. intel put out some papers on the topic, iirc. i'll dig 'em up. |
|
--- In fpga-cpu@y..., Josh Fryman <fryman@c...> wrote: > > in short, "no". > > so yes, superscalar and vliw both exploit ILP. the problem is that > most code only has an ILP of 1.5-2.5, so going beyond a few execution > Part of my point. The ILP is relatively low. So even with a superscaler architecture, it is not going to triple (lucky to double) the performance of the processor. > > the ia64 has a real RISC instruction set, so you can stop using the > crappy x86 one. I've studied the IA64 ISA and come to a conclusion that it's crappy as well. > to make a simpler system, you'd need to simplify the instruction set > further. probably just have LD/ST, ROL, ROR, LSL, LSR, OR, AND, BEQ. > anything else (like ADD, MUL, etc) would have to be tossed out. you'd > then have to implement s/w libraries for these complex instructions. > you'd have a system that could scream at 2GHz easily, but would take > 30-100 instructions to do a basic "a = x*y;" evaluation, so you When I was refering to simple processor, I meant simply not going superscaler, not eliminating basic instructions like ADD. A superscaler processor is extremely complex, containing zillions of comparators, a large register array for renaming, multiple execution units and lots of data paths. I'll guess a single issue RISC processor would be an order of magnitude smaller. Being so much smaller shouldn't it be possible to have it scream at 2-3GHz ? Or does the size of the core not end up making that much difference to the clock cycle ? > > to go to single issue isn't a big deal. the need for high clock is > what is the problem. if you want a high clock, there's very > little you can do in any given clock cycle. that says that you can > have a limited fanout from one gate to the next sequence before you > have clock skew and other problems showing up. at 1GHz, it's very > hard to get the clock to move far across a device. > > the problem is that you need high speed to compete with current > technology. > Part of the gist of my question was with reference to *future* process technology. I think superscaler *is* the way to go with today's technology. I'll have to admit I don't know the technology, so perhaps the difference in processor size does not make the difference my guts tells me it should. But, I've read for instance that the Compaq Alpha takes two cycles to access the L1 cache and multiple wide read ports are required to keep up with the processor core. If the L1 cache truly operates at the same speed as the core, then why did they do this ? Was this a result of improvments in process technology, and shouldn't the same trend hold in the future ? To me it looks like additional complex instructions are being added to ISA's to better use limited cache bandwidth. > > caches don't generally lag behind processors. > > the trick > of the L1 cache is to keep the design simple and sufficiently small > (and square in layout) that you can cover the L1 in 1 cycle. These two points seem opposed. Why is it such a trick to keep the L1 cache operating in a single cycle if it's just as fast as the processor core ? > to evaluate properly. even system simulations like SimpleScalar and > others suck when compared to real hardware. So, is it safe to assume that someone has actually tried comparing a speed optmizied single issue processor to a superscaler one in the same technology ? > > see "Computer Architecture A Quantitative Approach" by John L. Hennessy > and David A. Patterson, 2nd Ed. (the H&P reference.) this is the > "bible" of computer architecture, although it doesn't get too much into > I have read H&P, that's why I have questions. Sorry if I seem a bit argumentative, but I just want to be certain a superscaler architecture will not be made obsolete by future process improvements, before I spend a lot of time on one. Thanks Rob http://www.birdcomputer.ca |