Sign in

username:

password:



Not a member?

Search fpga-cpu



Search tips

Subscribe to fpga-cpu



fpga-cpu by Keywords

Altera | CISCifying | IDE | ISA | Java | JHDL | JTAG | LBU | MicroBlaze | PAR | PCI | RISC | SoC | Spartan | Transputers | Verilog | VHDL | Virtex | VLIW | WebPack | Xilinx | Xsoc | YARD-1A

Ads

Discussion Groups

Discussion Groups | FPGA-CPU | Fwd: Beyond the GFlops per chip

This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).

Fwd: Beyond the GFlops per chip - Rattus Norvegicus - Aug 11 17:16:00 2003

(Excuse me for my poor english)

Mainstream processor cannot provide more than few GFlops per chip,
however there are already the technologies needed to create a single
chip supercomputer. But this require to lose compatibility with
actual platforms. Let's go to problem.
A 3 Ghz CPU capable of up to three FPU operations per clock can
deliver up to 9 GFlops (peak theoretical performance). With current
instruction architecture we cannot do much better. Array operation
and VLIW instruction set can reach better performance. If we built a
1 Ghz VLIW CPU with 256 bit wide instruction word, that word can
contain up to eight 32 bit instructions packed together. If that CPU
has eight array units able to work on eight floating point numbers
packed in a 256 bit register we can reach the amazing performance of
64 GFlops (p.t.p.), which is far better than the previous CPU, even
if the clock speed is only a third.
The drawback of a VLIW CPU is the lack of code density, wasting
memory. This require a very large L1 cache too, L1 cache need to be
at least eight times that of current cpu to contain the same amount
of cached instructions.
Instruction Set Compression can save instruction cache space. Guess
you have built that cpu, and than created an OS and a working suit
of software as well, then suppose that all the code could be fit in
a 4 GWord memory (a very huge memory of 128 GByte), so not more than
2^32 different VLIW instructions are used of the 2^256 possible.
This mean that the CPU can use an hardwired 32 to 256 bit
instruction decoder. (The set of usefull instructions could be
selected by a computer). This way we can create a 8-way VLIW CPU
with use only 32 Bit instruction. This improve the code density and
require less bandwidth for instruction fetching. Giving more
bandwidth for feeding the array units.
Integrating four CPU like this in a chip and rising the speed to 4
GHz we reach the p.t.p. of 1 TFlops!!
This is the conventional way to step over the GFlop scale.
I'm sure there should be at least another way to reach that speed.
But how?
Many supercomputer applications require a short loop of instructions
to be executed for a large number of times. Hence we can create a
'algorithm unit' capable to execute the inner loop in only one clock
cycle. If the loop is of about one hundred instruction, we can reach
the performance of 100 GFlops a 1 GHz of speed. The new idea is to
execute a large number of sequential instructions at the same time
in a programmable pipeline. If we can feed the data to this
customizable pipeline at one input per clock we can say that the
unit can execute all instructions in a single clock. So if the
'algorithm unit' contains five hundred instructions we can reach the
p.t.p of 1 TFlops at 2 Ghz. Obviously we need a CPU that can feed
our 'algorithm unit'.
The problem now is how to make an 'algorithm unit'. My idea is to
build a grid of processing element, each composed by an FP ALU a
four data register (register R0-R3) and four i/o (mapped as register
R4-R7) and a program register (PR). The PR contains the instruction
that processing element must execute. Each processing element has a
one-instruction program, no longer program are needed. The program
is made of a 32 bit 3 field instruction: the first field is the
test, the other two are the instructions to execute upon the base of
the test. The instruction looks like:
IF Ra (=|<|>|!=) Rb THEN Rc = Rd (+|-|*|/|...) Re ELSE Rf = Rg
(+|-|*|/|...) Rh
Reading from an input (R4-R7) require an interlock, the processing
element must wait until the element to witch is connected does a
write onto the link. Each processing element must execute each
instruction in the same time. The connected CPU must feed the
'algorithm unit' and collect the results, so must have the ability
to read and write each data register and the program register of
each PE in the unit. A 32x32 toroidal grid easy contains a five
hundred instructions loop.

Now i'm asking if someone will take the quest to design a 1TFlops CPU...

Surmolotto

_________________________________________________________________
Filtri antispamming e antivirus per la tua casella di posta
http://www.msn.it/msn/hotmail





(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )


AW: Fwd: Beyond the GFlops per chip - Kolja Sulimma - Aug 22 16:26:00 2003

> so not more
> than 2^32 different VLIW instructions are used of the 2^256
> possible. This mean that the CPU can use an hardwired 32 to
> 256 bit instruction decoder.

This decoder will implement a boolean function with 32 inputs and 256
outputs.
It can be proven that at least halve the functions in that category can
not be implemented smaller than a 128GByte ROM.
You need a mapping function that is regular enough that it is one of the
very few functions that can be implemented a lot
smaller than exponential size.

Regards,

Kolja Sulimma




(You need to be a member of fpga-cpu -- send a blank email to fpga-cpu-subscribe@yahoogroups.com )