This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
Just reading the latest News: "8-Bit Micro controller for Virtex Devices. If I may be permitted to quote so extensively, I'll let this superb app note speak for itself: "..."What do I consider small? Not 950 or 1100 or 1700 logic cells. Certainly not 3000. By small, I mean cores like this excellent assembler-programmable KCPSM (35 CLBs => ~140 logic cells) or the integer-C-programmable xr16(~300 logic cells)" Small is often good but is the smallest the BEST? Looking at the common designs: Risc machines and stack machines (FORTH) while have smallest size being just a Alu with some jump logic on the PC. This is the racing engine of computing - fast but not powerful. Cisc's on the other hand are overburdened with opcode decoding. A turning machine has the smallest data path but a very large control section. This is the diesel engine of computing - powerful but most the just idling away. 4 and 8 bit micros are the standby in embedded items. A washing machine doers not need the latest 1GHZ cpu.The 2 cycle engine that powers your lawnmower. The umm... strange we don't have cpu the fits in this category, the just right CPU, that is the automobile of computing. The PDP-11 is close but memory addressing is only 64kb. Right now RISC machines are the fastest needing only a few CLB'S for the ALU. The alu could be limited in function say add,nor,shift left,shift right. Anything we can't do in one cycle we can do in two or three... Memory is fast. Decoding is quick,and the alu is small. But is memory fast? With the speed of memory limited because of external buffering,cache lookups and bus setups and holds the speed of main memory is compromised to say 2-3x the access of time of a memory element. Perhaps rather than looking at # of CLB's one needs to look again at the overall picture.As a crude example: A full featured ALU with limited shifting could be about 50% more than the minimum needed of 3 or 4 CLB's per bit say 8 CLB's. Registers like MAR,Input,Output and byte logic and memory could take say 8 more CLB's. For a 16 bit cpu this 256 CLB's. A risc computer would take 25% of that for control giving about 320 CLB's for 16 bit computer. A simple CISC computer would use say 100% with about 500 CLB's for the same 16 bit CPU. FPGA's are getting faster as dies become smaller while external memory stays the same speed. This could push designs that require fast main memory to a more CISC style of design. Adding features could push a 16 bit design to say 200% giving 768 CLB's. Now this is getting to be a BIG complex design,yuck!. The Risc/Forth design requires too fast a memory. Could not a streamlined CSIC be designed use only about 50% of CLB's used for the data path and yet give a processor that has a good bit of power? Ben. -- "We do not inherit our time on this planet from our parents... We borrow it from our children." "Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk |
|
|
|
--- In , Ben Franchuk <bfranchuk@j...> wrote: > Just reading the latest News: > The Risc/Forth design requires too fast a memory. Could not a streamlined > CSIC be designed use only about 50% of CLB's used for the data path > and yet give a processor that has a good bit of power? > Ben. > -- I've had similar thoughts, and I started designing a streamlined CISC but dumped it. The tough part is arguing the requirements, and justifying how a CISC design would fill those requirements. The primary motivation for a CISC design is it's conservative use of opcode space. IE. embedded memory resources. I'm not sure what the RISC/CISC ratio for opcode usage is (has anyone done any reasearch on this?) I suspect CISC is not as good as one might think because alot can be done with registers in a RISC) It's possible to design a fairly conservative RISC processor. The control logic for a CISC takes more room than a simple RISC design. That extra room used by the CISC's control logic can be traded for extra memory in a RISC design. A simple RISC design might be easier to debug, and use less man hours in development than a CISC design (although I'm just guessing here). If you really want to get the most bang for the byte, you can always use a simple bytecode interpreter with a RISC design, perhaps having the whole interpreter in cache (dedicated ROM), like a microcode store :) I used to be a big fan of CISC designs, then I started studying architectures and have since become a big fan of RISC designs. I might go back and finish that CISC design just for comparison purposes. PS. Isn't a simple CISC design a RISC processor by definition ? |
|
Rob Finch wrote: > > I've had similar thoughts, and I started designing a streamlined CISC > but dumped it. The tough part is arguing the requirements, and > justifying how a CISC design would fill those requirements. The > primary motivation for a CISC design is it's conservative use of > opcode space. IE. embedded memory resources. I'm not sure what the > RISC/CISC ratio for opcode usage is (has anyone done any reasearch on > this?) I suspect CISC is not as good as one might think because alot > can be done with registers in a RISC) It's possible to design a > fairly conservative RISC processor. The control logic for a CISC > takes more room than a simple RISC design. That extra room used by > the CISC's control logic can be traded for extra memory in a RISC > design. A simple RISC design might be easier to debug, and use less > man hours in development than a CISC design (although I'm just > guessing here). I view Risc machines as Micro-coded hardware that uses all of main memory as micro-code. > If you really want to get the most bang for the byte, you can always > use a simple bytecode interpreter with a RISC design, perhaps having > the whole interpreter in cache (dedicated ROM), like a microcode > store :) This is the kind of thought that made CISC complex.Very good byte operations very bad at everything else. The 8086 instruction set comes to mind here. A dedicated fast memory segment is nice but rarity can you get a small OS now days. > I used to be a big fan of CISC designs, then I started studying > architectures and have since become a big fan of RISC designs. > I might go back and finish that CISC design just for comparison > purposes. > > PS. Isn't a simple CISC design a RISC processor by definition ? Nope it is a load/store design. The cleanest designs I have seen are the still the machines from the early 60's like the PDP-8 or the PDP-4. I am guess a fan of the OLD IRON... > To Post a message, send it to: > To Unsubscribe, send a blank message to: Ben. -- "We do not inherit our time on this planet from our parents... We borrow it from our children." "Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk |
|
> > PS. Isn't a simple CISC design a RISC processor by definition ? http://www.cs.uiowa.edu/~jones/arch/cisc/ Do you consider this as RISC ? (Just an example) But nevertheless I am of the opinion that there are architectures where the RISC/CISC destinction is quite difficult. For example: Is the good old subtract-and-branch one instruction machine RISC or CISC ? pro RISC: - fixed instruction length. - orthogonal "register" set.. (yes, there is no difference between memory and registers) - no complex adressing modes. - fixed data size pro CISC: - Multicycle operation. (Though the one instruction could always be broken up into one subtract and one branch instruction) |
|
Tim Böscke wrote: > > > > > PS. Isn't a simple CISC design a RISC processor by definition ? > > > > http://www.cs.uiowa.edu/~jones/arch/cisc/ > > Do you consider this as RISC ? (Just an example) I consider it to be a stack machine... But then I don't teach computer architecture. I consider a CISC machine to be single address machine. > But nevertheless I am of the opinion that there are architectures > where the RISC/CISC destinction is quite difficult. > > For example: Is the good old subtract-and-branch one instruction > machine RISC or CISC ? RISC machine - very reduced instruction set :). > pro RISC: > - fixed instruction length. > - orthogonal "register" set.. (yes, there is no difference between memory and registers) > - no complex adressing modes. > - fixed data size Look at the classic computer designs like the PDP-8. The big difference is load/store design and a multitude of internal registers compared to other designs. > > pro CISC: > - Multicycle operation. (Though the one instruction could always be broken up > into one subtract and one branch instruction) The multicyle operation is only because IBM pushed for the 8 bit byte in the 360 computers. This brought the smallest word size down from 12 bits to 8 bits, making any computer using the new format have to be multi-cycle. The 360 being a large machine could afford to process data in 16 or 32 bit chunks and not be slowed down. > To Post a message, send it to: > To Unsubscribe, send a blank message to: Ben. -- "We do not inherit our time on this planet from our parents... We borrow it from our children." "Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk |
|
> Small is often good but is the smallest the BEST? If "smallest" delivers on requirements (e.g. fast enough, C programmable, has interrupt handling, or what have you), probably yes. "A small cat is better than a large cat because it eats less, poops less, and sheds less." "So it follows that the ideal cat is a cat of zero length?" As with so many things, the first few resource units provide the essentials. The rest are luxuries. As you climb the luxury curve, each resource spent provides less and less additional value. Sometimes supposed luxuries (like deeper pipelines) make things worse. If you add up the number of 4-LUTs in a minimal "bare necessities" n-bit processor datapath, for example, Cost What n 1 port 16-entry register file n adder/subtractor n logic unit 0 TBUF-based immediate mux 0 TBUF-based operand mux --- 3n you can build a simple streamlined RISC datapath in only 3n logic cells. Maybe even 2n if your ALU operation is "add/nand". If you're willing to multi-cycle it (take k cycles per word) then it's 3n/k or 2n/k. But it takes a few cycles to execute even one "RISC instruction" like add r3,r1,r2: (assume r[0]=0, rPC=1, r[2]=2, bus is 3-state bus, t is temp reg, ir is instruction register) ; increment PC and fetch insn t = bus <- r[2] r[rPC] = mar = bus <- r[rPC] + t ir = mem[mar] ; add instruction t = bus <- r[ir.ra] t = bus <- r[ir.rb] + t r[ir.rd] = t If you're only building a toaster SoC, or a toaster channel processor, where 100 kHz frequency would be quite adequate, you might as well build the 3n or 3n/k datapath. But if that's not fast enough, if you need closer to one instruction per cycle, you must add resources. The first thing you add is a dedicated PC register, PC adder/incrementor, and PC mux. Next you add a second read port to the register file, and perhaps a concurrent write port too. And you add a result multiplexor to select among the various results (add, logic, shifts, load-data-in, return address, etc.): Cost What 2n-4n 2r1w 16-entry register file n adder/subtractor n logic unit 0-6n result multiplexer n PC n PC incrementer n PC mux --- 7n-15n This is a lot more costly, but is now approximately one instruction per cycle. If you still need more speed, you'll add pipelining to reduce the cycle time. (But add 2n (or more) for result forwarding muxes for each stage.) Each new pipeline stage you add will reduce the cycle time until the diminishing returns set in, possibly due to the extra interconnect delay incurred by signalling across many result forwarding multiplexers. If you still need more speed, you'll think about multiple issue, out-of-order, LIW, custom function units, or perhaps multiple processors on chip. Including control unit overhead, etc., xr16 is about 300 logic cells / 16 bits = ~20n overall, xr32 about ~14n overall. Jan Gray Gray Research LLC |
|
|
|
Jan Gray wrote: > If "smallest" delivers on requirements (e.g. fast enough, C programmable, > has interrupt handling, or what have you), probably yes. True except maybe for some unnamed OS's and sales people. > As with so many things, the first few resource units provide the essentials. > The rest are luxuries. As you climb the luxury curve, each resource spent > provides less and less additional value. Sometimes supposed luxuries (like > deeper pipelines) make things worse. True but sometimes cutting corners has a big impact on things. While not hardware, I am thinking how the serial port on the PC is not interupt driven under DOS. > If you add up the number of 4-LUTs in a minimal "bare necessities" n-bit > processor datapath, for example, > > Cost What > n 1 port 16-entry register file > n adder/subtractor > n logic unit > 0 TBUF-based immediate mux > 0 TBUF-based operand mux > --- > 3n Other FPGA's could have slightly different layouts but still a low value for N. > you can build a simple streamlined RISC datapath in only 3n logic cells. > Maybe even 2n if your ALU operation is "add/nand". If you're willing to > multi-cycle it (take k cycles per word) then it's 3n/k or 2n/k. > > But it takes a few cycles to execute even one "RISC instruction" like add > r3,r1,r2: > > (assume r[0]=0, rPC=1, r[2]=2, bus is 3-state bus, t is temp reg, ir is > instruction register) > ; increment PC and fetch insn > t = bus <- r[2] > r[rPC] = mar = bus <- r[rPC] + t > ir = mem[mar] > ; add instruction > t = bus <- r[ir.ra] > t = bus <- r[ir.rb] + t > r[ir.rd] = t A different alu design like the 2901's could reduce this to. t = ir.rb mar = rPC, rPC <- rPC + #2 ir = mem[mar],ir.ra = ir.ra + t > But if that's not fast enough, if you need closer to one instruction per > cycle, you must add resources. The first thing you add is a dedicated PC > register, PC adder/incrementor, and PC mux. Next you add a second read port > to the register file, and perhaps a concurrent write port too. And you add > a result multiplexor to select among the various results (add, logic, > shifts, load-data-in, return address, etc.): Hey I thought Risc was simple? Is this not what Complex computers do now? > Cost What > 2n-4n 2r1w 16-entry register file > n adder/subtractor > n logic unit > 0-6n result multiplexer > n PC > n PC incrementer > n PC mux > --- > 7n-15n > > This is a lot more costly, but is now approximately one instruction per > cycle. True but that is because we now have a Harvard style machine. One data memory (on the cpu only) and one program memory ( main memory). > If you still need more speed, you'll add pipelining to reduce the cycle > time. (But add 2n (or more) for result forwarding muxes for each stage.) > Each new pipeline stage you add will reduce the cycle time until the > diminishing returns set in, possibly due to the extra interconnect delay > incurred by signalling across many result forwarding multiplexers. I agree fully here. Also the limiting factor in any case is the adder delay time as that is the biggest delay in the system. used. > If you still need more speed, you'll think about multiple issue, > out-of-order, LIW, custom function units, or perhaps multiple processors on > chip. And more gray hair unless you go bald. > Including control unit overhead, etc., xr16 is about 300 logic cells / 16 > bits = ~20n overall, xr32 about ~14n overall. > The numbers seem in the right ballpark. Ben. -- "We do not inherit our time on this planet from our parents... We borrow it from our children." "Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk |
|
Jan Gray wrote: > The License Agreement speaks for itself and takes precedence over anything I > might write here. That said, one could interpret it to not permit any use > the work or any derivative work for any commercial purpose, and further to > not permit any distribution of a derivative work (except for a modification > to an excerpt). Why copywrite the CPU? Copywrite the BUGS in the CPU or specific workarounds for hardware limitations. It seems to me the bug fixes and workarounds stay in the code forever and thus give could the longest revenue for a product.:) I like the idea of split license but figuring just what you can copywrite takes a bit of thinking. Ben. -- "We do not inherit our time on this planet from our parents... We borrow it from our children." "Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk |
|
Jan, How small do you think a xr16-opcode-compatible cpu could be if one didn't care about speed? Do you think it could get below 150 logic cells, maybe? And, if I were to do one of these in VHDL would that violate the spirit of your no-commercial-use license? Best Regards, Gary Watson Technical Director Nexsan Technologies, Ltd. Imperial House East Service Road Raynesway Derby DE21 7BF ENGLAND +44 (0) 1332 5 444 33 http://www.nexsan.com |
|
|
|
> How small do you think a xr16-opcode-compatible cpu could be if one didn't > care about speed? Do you think it could get below 150 logic cells, maybe? Let me take you on a tour of less through more drastic changes to xr16 to save area. At some point it ceases to be xr16, but retains its character. This is no problem because we are now adept at porting to similar instruction sets, through small changes to our lcc .md file, or to the assembler. A non-pipelined, no-DMA xr[n] would save (at least) these resources: Savings What n LUTs A forwarding mux 2n FFs A, B operand registers n LUTs PC register file (n FFs) PC register n FFs RETAD register n FFs DOUT register n FFs EXIR ----- 2n LUTs 4n FFs Changing the memory interface to Harvard style would eliminate the need to save the next instruction in NEXTIR in the event of a load/store instruction (see 2nd Circuit Cellar article): Savings What n LUTs NEXTIR ----- 3n LUTs 4n FFs Changing the way interrupts work, or cutting them entirely, would eliminate any further need for IRMUX: Savings What n LUTs IRMUX ----- 4n LUTs 4n FFs Changing the instruction set to 2-operand (no r3=r1 op r2, only r1=r1 op r2) could save: Savings What n LUTs 2nd copy of register file n FFs " ----- 5n LUTs 5n FFs Move PC into the register file (say r13), so that each instruction needs an ifetch sub-cycle and an execute sub-cycle. Here branch displacements would be added via the IMM mux. All addresses would be available at the ADDSUB output. Savings: Savings What n FFs PC n LUTs PCINCR 8 LUTs PCDISP n LUTs ADDRMUX ----- 7n+8 LUTs 6n FFs Using the output register of a block RAM as the instruction register would save: Savings What n FFs IR ----- 7n+8 LUTs 7n FFs Halve the datapath into an 8-bit tall datapath and take 2 sub-sub-cycles per sub-cycle. Savings What n/2 LUTs addsub n/2 LUTs logic n/2 FFs reg file output register ----- 8n+8 LUTs 7 1/2 n FFs For n=16, we save 136 LUTs and 120 FFs, so we have a 4-cycle per instruction RISC in about 165 LUTS => 165 logic cells. So 150 is not out of the question, but you'd need to change the ISA further. Of course, this exercise is a little strained because we are taking things away from something substantial instead of building something simpler up from nothing. > And, if I were to do one of these in VHDL would that violate the spirit of > your no-commercial-use license? The License Agreement speaks for itself and takes precedence over anything I might write here. That said, one could interpret it to not permit any use the work or any derivative work for any commercial purpose, and further to not permit any distribution of a derivative work (except for a modification to an excerpt). So, are third party implementations of xr16 considered derivative works? That depends upon how they are prepared. If one such implementation contains any part of, or are a mere translation from, the XSOC Kit sources, it's probably a derivative work. I am contemplating cleaving the XSOC/xr16 project in two. Here's the concept: The first part, the instruction set specifications, tests, and tools, (except for the code covered by the lcc license), GR LLC would relicense under some open source license that preserves the integrity of the xr16/xr32 name. The second part, the implementation -- the schematics, HDL code, and documentation related to that, GRLLC would continue to license under the XSOC License Agreement. This action, *if taken*, might help clarify the status of third party implementations based upon the xr specs and tests and would permit non-derivative clean-room implementations to be used without contamination with any XSOC-licensed works. It would also make it easier for you "third parties" to enhance and redistribute changes to the xr tools suite. If you strongly favor this, please let me know through private email. Jan Gray Gray Research LLC |
|
|
|
Jan, following the suggestion on your web site, I looked at the Xilinx app note Xapp213 which describes their KCPSM microcontroller. It's pretty cool that they fit it in 35 CLB's and made a snazzy assembler for it. The only thing that I'm uneasy about is the fact that it has a 256 instruction (16 bit wide) limit. I'm pretty sure I need a few k for what I want to do. I'm going to send him feedback to ask how much work it would be to make an upscale version of KCPSM... Best Regards, Gary Watson Technical Director Nexsan Technologies, Ltd. Imperial House East Service Road Raynesway Derby DE21 7BF ENGLAND +44 (0) 1332 5 444 33 http://www.nexsan.com -----Original Message----- From: Jan Gray [mailto:] Sent: Saturday, October 07, 2000 7:00 PM To: fpga-cpu Subject: RE: [fpga-cpu] Just what is small and is it the best? > How small do you think a xr16-opcode-compatible cpu could be if one didn't > care about speed? Do you think it could get below 150 logic cells, maybe? Let me take you on a tour of less through more drastic changes to xr16 to save area. At some point it ceases to be xr16, but retains its character. This is no problem because we are now adept at porting to similar instruction sets, through small changes to our lcc .md file, or to the assembler. [ excellent discussion condensed ] |