EmbeddedRelated.com
Forums

Custom CPU Designs

Started by Rick C April 16, 2020
In the Forth language group there are occasional discussions of custom processors.  This is mostly because a simple stack processor can be designed to be implemented in an FPGA very easily, taking little resources and running at a good speed.  Such a stack processor is a good target for Forth. 

I know there are many other types of CPU designs which are implemented in FPGAs.  I'm wondering how often these are used by the typical embedded developer?  

-- 

  Rick C.

  -- Get 1,000 miles of free Supercharging
  -- Tesla referral code - https://ts.la/richard11209
Rick C <gnuarm.deletethisbit@gmail.com> wrote:
> In the Forth language group there are occasional discussions of custom > processors. This is mostly because a simple stack processor can be > designed to be implemented in an FPGA very easily, taking little resources > and running at a good speed. Such a stack processor is a good target for > Forth. > > I know there are many other types of CPU designs which are implemented in > FPGAs. I'm wondering how often these are used by the typical embedded > developer?
It's not uncommon when you have an FPGA doing some task to need a processor in there to manage it - for example to handle error conditions or to report status over some software-friendly interface like I2C, or classical printf debugging on a UART. Xilinx provides Microblaze and Intel/Altera provide NIOS, which are a bit legacy (toolchains aren't that great, etc), although designed to be fairly small. There's increasing interest in small 32-bit RISC-V cores for this niche, since the toolchain is improving and the licensing conditions more liberal (RISC-V itself doesn't design cores, but lots of others - including ourselves - have made open source cores to the RISC-V specification). I don't see that Forth has much to speak to this niche, given these cores are largely offloading work that's fiddly to do in hardware and people just want to write some (usually) C to mop up the loose ends. I can't see what Forth would gain them for this. If you're talking using these processors as a compute overlay for FPGAs, a different thing entirely, I'd suggest that a stack processor is likely not very efficient in terms of throughput. Although I don't have a cite to back that up. Theo
On 16/04/2020 12:59, Theo wrote:
> Rick C <gnuarm.deletethisbit@gmail.com> wrote: >> In the Forth language group there are occasional discussions of custom >> processors. This is mostly because a simple stack processor can be >> designed to be implemented in an FPGA very easily, taking little resources >> and running at a good speed. Such a stack processor is a good target for >> Forth. >> >> I know there are many other types of CPU designs which are implemented in >> FPGAs. I'm wondering how often these are used by the typical embedded >> developer? > > It's not uncommon when you have an FPGA doing some task to need a processor > in there to manage it - for example to handle error conditions or to report > status over some software-friendly interface like I2C, or classical printf > debugging on a UART. > > Xilinx provides Microblaze and Intel/Altera provide NIOS, which are a bit > legacy (toolchains aren't that great, etc), although designed to be fairly > small. There's increasing interest in small 32-bit RISC-V cores for this > niche, since the toolchain is improving and the licensing conditions more > liberal (RISC-V itself doesn't design cores, but lots of others - including > ourselves - have made open source cores to the RISC-V specification). > > I don't see that Forth has much to speak to this niche, given these cores > are largely offloading work that's fiddly to do in hardware and people just > want to write some (usually) C to mop up the loose ends. I can't see what > Forth would gain them for this. > > If you're talking using these processors as a compute overlay for FPGAs, a > different thing entirely, I'd suggest that a stack processor is likely not > very efficient in terms of throughput. Although I don't have a cite to back > that up. >
That matches what I have seen from customers. Very few people design their own cpus - it is rarely worth the effort. (People do it for fun, which is a different matter.) They want off-the-shelf cores and off-the-shelf tools. And they want off-the-shelf programmers to work with them. There was a time when FPGA's were smaller and more limited, where you might want an absolute minimal sized CPU core. Then a stack-based core with a small decoder would be a good choice, and Forth a good fit. Those days are long gone. Modern programmable logic is bigger and have architectural features that are a better fit for "standard" 32-bit RISC cores than older devices. And the design tools make it easy - you pick your core, your peripherals, your memories, your interfaces from the libraries and let the tools make the buses, C header files, and everything else. I can appreciate that Forth can be a very efficient language, and Forth-oriented cpus can be small and fast. But efficiency of the hardware is not the only goal of a design. I would expect that Forth and stack processors are almost exclusively used by developers who have been using them for the last twenty years already.
On 2020-04-16, Theo <theom+news@chiark.greenend.org.uk> wrote:

> It's not uncommon when you have an FPGA doing some task to need a processor > in there to manage it - for example to handle error conditions or to report > status over some software-friendly interface like I2C, or classical printf > debugging on a UART.
I worked on a project using a NIOS2 CPU core in an Altera FPGA. It was a complete disaster. The CPU's throughput (running at 60MHz) was lower than the 40MHz ARM7 we were replacing. The SDRAM controller's bandwidth was so low it couldn't keep up with 100Mbps full duplex Ethernet. The SW development environment was incredibly horrible and depended on some ancient version of Eclipse (and required a license dongle) and a bunch of bloated tools written by some incompetent contractor. There were Java apps that generated shell scripts that generated TCL programs that generated C headers (or something like that). The JTAG debugging interface often locked up the entire CPU. After spending a year developing prototype products that would never quite work, we abandoned the NIOS2 in favor of an NXP Cortex M3. -- Grant
On Thursday, April 16, 2020 at 6:59:29 AM UTC-4, Theo wrote:
> Rick C <gnuarm.deletethisbit@gmail.com> wrote: > > In the Forth language group there are occasional discussions of custom > > processors. This is mostly because a simple stack processor can be > > designed to be implemented in an FPGA very easily, taking little resources > > and running at a good speed. Such a stack processor is a good target for > > Forth. > > > > I know there are many other types of CPU designs which are implemented in > > FPGAs. I'm wondering how often these are used by the typical embedded > > developer? > > It's not uncommon when you have an FPGA doing some task to need a processor > in there to manage it - for example to handle error conditions or to report > status over some software-friendly interface like I2C, or classical printf > debugging on a UART. > > Xilinx provides Microblaze and Intel/Altera provide NIOS, which are a bit > legacy (toolchains aren't that great, etc), although designed to be fairly > small. There's increasing interest in small 32-bit RISC-V cores for this > niche, since the toolchain is improving and the licensing conditions more > liberal (RISC-V itself doesn't design cores, but lots of others - including > ourselves - have made open source cores to the RISC-V specification).
Someone showed me that RISC-V can be a small solution, but at that size it isn't fast. So the size performance tradeoff isn't so great. Stack processors tend to be very lean and speed effective without pipelining. I know my processor was designed to run 1 instruction per clock cycle on all instructions.
> I don't see that Forth has much to speak to this niche, given these cores > are largely offloading work that's fiddly to do in hardware and people just > want to write some (usually) C to mop up the loose ends. I can't see what > Forth would gain them for this.
Not sure what you mean about Forth and "speaking". Forth is a natural language for a stack based processor. Often there is a 1 to 1 mapping of instructions to language words. It is also very easy to retarget to new modifications to a processor design.
> If you're talking using these processors as a compute overlay for FPGAs, a > different thing entirely, I'd suggest that a stack processor is likely not > very efficient in terms of throughput. Although I don't have a cite to back > that up.
I recall someone had a very large compendium of soft core processors with size and speed measurements with a calculation of something like IPS/LUT. It was amazing what some designs could achieve. I wish I knew where to find that now. -- Rick C. + Get 1,000 miles of free Supercharging + Tesla referral code - https://ts.la/richard11209
On Thursday, April 16, 2020 at 9:40:55 AM UTC-5, Rick C wrote:
> On Thursday, April 16, 2020 at 6:59:29 AM UTC-4, Theo wrote: > > Rick C <gnuarm.deletethisbit@gmail.com> wrote: > > > In the Forth language group there are occasional discussions of custom > > > processors. This is mostly because a simple stack processor can be > > > designed to be implemented in an FPGA very easily, taking little resources > > > and running at a good speed. Such a stack processor is a good target for > > > Forth. > > > > > > I know there are many other types of CPU designs which are implemented in > > > FPGAs. I'm wondering how often these are used by the typical embedded > > > developer? > > > > It's not uncommon when you have an FPGA doing some task to need a processor > > in there to manage it - for example to handle error conditions or to report > > status over some software-friendly interface like I2C, or classical printf > > debugging on a UART. > > > > Xilinx provides Microblaze and Intel/Altera provide NIOS, which are a bit > > legacy (toolchains aren't that great, etc), although designed to be fairly > > small. There's increasing interest in small 32-bit RISC-V cores for this > > niche, since the toolchain is improving and the licensing conditions more > > liberal (RISC-V itself doesn't design cores, but lots of others - including > > ourselves - have made open source cores to the RISC-V specification). > > Someone showed me that RISC-V can be a small solution, but at that size it isn't fast. So the size performance tradeoff isn't so great. Stack processors tend to be very lean and speed effective without pipelining. I know my processor was designed to run 1 instruction per clock cycle on all instructions. > > > > I don't see that Forth has much to speak to this niche, given these cores > > are largely offloading work that's fiddly to do in hardware and people just > > want to write some (usually) C to mop up the loose ends. I can't see what > > Forth would gain them for this. > > Not sure what you mean about Forth and "speaking". Forth is a natural language for a stack based processor. Often there is a 1 to 1 mapping of instructions to language words. It is also very easy to retarget to new modifications to a processor design. > > > > If you're talking using these processors as a compute overlay for FPGAs, a > > different thing entirely, I'd suggest that a stack processor is likely not > > very efficient in terms of throughput. Although I don't have a cite to back > > that up. > > I recall someone had a very large compendium of soft core processors with size and speed measurements with a calculation of something like IPS/LUT. It was amazing what some designs could achieve. I wish I knew where to find that now. > > -- > > Rick C. > > + Get 1,000 miles of free Supercharging > + Tesla referral code - https://ts.la/richard11209
|>very large compendium of soft core processors https://opencores.org/projects/up_core_list/summary Several legacy processors are listed: https://opencores.org/projects/up_core_list/downloads uP_core_list_by_style-clone190221.pdf Also look into MISTer as it supports several legacy systems. None are competitive speed wise with high performance uP. With LUTs costing less than $0.001 each some soft core uPs are inexpensive, free if you have unused LUTs and block RAMs. For debug, changing block RAM contents is much faster than rerunning the FPGA design.
On Thursday, April 16, 2020 at 3:28:22 PM UTC-4, jim.br...@ieee.org wrote:
> On Thursday, April 16, 2020 at 9:40:55 AM UTC-5, Rick C wrote: > > On Thursday, April 16, 2020 at 6:59:29 AM UTC-4, Theo wrote: > > > Rick C <gnuarm.deletethisbit@gmail.com> wrote: > > > > In the Forth language group there are occasional discussions of custom > > > > processors. This is mostly because a simple stack processor can be > > > > designed to be implemented in an FPGA very easily, taking little resources > > > > and running at a good speed. Such a stack processor is a good target for > > > > Forth. > > > > > > > > I know there are many other types of CPU designs which are implemented in > > > > FPGAs. I'm wondering how often these are used by the typical embedded > > > > developer? > > > > > > It's not uncommon when you have an FPGA doing some task to need a processor > > > in there to manage it - for example to handle error conditions or to report > > > status over some software-friendly interface like I2C, or classical printf > > > debugging on a UART. > > > > > > Xilinx provides Microblaze and Intel/Altera provide NIOS, which are a bit > > > legacy (toolchains aren't that great, etc), although designed to be fairly > > > small. There's increasing interest in small 32-bit RISC-V cores for this > > > niche, since the toolchain is improving and the licensing conditions more > > > liberal (RISC-V itself doesn't design cores, but lots of others - including > > > ourselves - have made open source cores to the RISC-V specification). > > > > Someone showed me that RISC-V can be a small solution, but at that size it isn't fast. So the size performance tradeoff isn't so great. Stack processors tend to be very lean and speed effective without pipelining. I know my processor was designed to run 1 instruction per clock cycle on all instructions. > > > > > > > I don't see that Forth has much to speak to this niche, given these cores > > > are largely offloading work that's fiddly to do in hardware and people just > > > want to write some (usually) C to mop up the loose ends. I can't see what > > > Forth would gain them for this. > > > > Not sure what you mean about Forth and "speaking". Forth is a natural language for a stack based processor. Often there is a 1 to 1 mapping of instructions to language words. It is also very easy to retarget to new modifications to a processor design. > > > > > > > If you're talking using these processors as a compute overlay for FPGAs, a > > > different thing entirely, I'd suggest that a stack processor is likely not > > > very efficient in terms of throughput. Although I don't have a cite to back > > > that up. > > > > I recall someone had a very large compendium of soft core processors with size and speed measurements with a calculation of something like IPS/LUT. It was amazing what some designs could achieve. I wish I knew where to find that now. > > > > -- > > > > Rick C. > > > > + Get 1,000 miles of free Supercharging > > + Tesla referral code - https://ts.la/richard11209 > > |>very large compendium of soft core processors > https://opencores.org/projects/up_core_list/summary > > Several legacy processors are listed: > https://opencores.org/projects/up_core_list/downloads > uP_core_list_by_style-clone190221.pdf > > Also look into MISTer as it supports several legacy systems. > None are competitive speed wise with high performance uP. > > With LUTs costing less than $0.001 each some soft core uPs are inexpensive, free if you have unused LUTs and block RAMs. > For debug, changing block RAM contents is much faster than rerunning the FPGA design.
I don't think the issue is very often $ with a soft core. At least for me it's about board space and integration with the other FPGA functions. Looks like I was mistaken about the speed/size of the RISC-V core. However... it appears to have been hand optimized if I am reading this correctly. "GRVI is an FPGA-efficient RISC-V RV32I soft processor core, hand technology mapped and floorplanned for best performance/area" That means it can't be ported to other families without much effort to achieve similar results. But still, assuming it drops off to half the numbers it's still a very good design. Thanks for the link and also all the work you did on this list. -- Rick C. -- Get 1,000 miles of free Supercharging -- Tesla referral code - https://ts.la/richard11209
Grant Edwards <invalid@invalid.invalid> wrote:
> On 2020-04-16, Theo <theom+news@chiark.greenend.org.uk> wrote: > > > It's not uncommon when you have an FPGA doing some task to need a processor > > in there to manage it - for example to handle error conditions or to report > > status over some software-friendly interface like I2C, or classical printf > > debugging on a UART. > > I worked on a project using a NIOS2 CPU core in an Altera FPGA. It > was a complete disaster. The CPU's throughput (running at 60MHz) was > lower than the 40MHz ARM7 we were replacing. The SDRAM controller's > bandwidth was so low it couldn't keep up with 100Mbps full duplex > Ethernet. The SW development environment was incredibly horrible and > depended on some ancient version of Eclipse (and required a license > dongle) and a bunch of bloated tools written by some incompetent > contractor. There were Java apps that generated shell scripts that > generated TCL programs that generated C headers (or something like > that). The JTAG debugging interface often locked up the entire CPU.
That's a pretty good summary of the experience. A few things have improved slightly: - if you have the space it's better to use an on-chip BRAM rather than SDRAM, given the SDRAM is often running at ~100MHz x 16 bits, which makes even instruction fetch multi-cycle. DDR is better but the memory controllers are much more complex, and life gets easier when you have cache. - they've upgraded to a plugin for a modern version of Eclipse rather than a fork from 2005, but it's still Eclipse :( I just drive the shell scripts directly (although the pile of Make they generate isn't that nice). I've never had it need a dongle. - the JTAG interface is horrible and I've spent way too much time reverse engineering it[1] and working around its foibles, in particularly the JTAG UART which is broken in several ways (google 'JTAG Atlantic' for some workarounds)
> After spending a year developing prototype products that would never > quite work, we abandoned the NIOS2 in favor of an NXP Cortex M3.
These days a RISC-V CPU in FPGA solves a lot of the horrible proprietaryness, although you still have to glue the toolchain together yourself (git clone riscv-gcc). But if you can do the job with a hard MCU I can't see why you'd want an FPGA. Personally I much prefer the ARM cores on FPGAs these days - they're a Proper CPU that Just Works. And the Cortex A-class cores can boot Linux which makes the software development workflow a lot nicer. Although they aren't that beefy (a $10000 Stratix 10 has a quad core A53, which is a Raspberry Pi 3), often hard to get parts with them in, and the bandwidth between ARM and soft logic often isn't very good. Theo [1] Did you know the Altera product codenames were based on dragons from How to Train Your Dragon? Interesting what you find out when running strace(1) on the binary...
On Thursday, April 16, 2020 at 3:50:10 PM UTC-5, Rick C wrote:
> On Thursday, April 16, 2020 at 3:28:22 PM UTC-4, jim.br...@ieee.org wrote: > > On Thursday, April 16, 2020 at 9:40:55 AM UTC-5, Rick C wrote: > > > On Thursday, April 16, 2020 at 6:59:29 AM UTC-4, Theo wrote: > > > > Rick C <gnuarm.deletethisbit@gmail.com> wrote: > > > > > In the Forth language group there are occasional discussions of custom > > > > > processors. This is mostly because a simple stack processor can be > > > > > designed to be implemented in an FPGA very easily, taking little resources > > > > > and running at a good speed. Such a stack processor is a good target for > > > > > Forth. > > > > > > > > > > I know there are many other types of CPU designs which are implemented in > > > > > FPGAs. I'm wondering how often these are used by the typical embedded > > > > > developer? > > > > > > > > It's not uncommon when you have an FPGA doing some task to need a processor > > > > in there to manage it - for example to handle error conditions or to report > > > > status over some software-friendly interface like I2C, or classical printf > > > > debugging on a UART. > > > > > > > > Xilinx provides Microblaze and Intel/Altera provide NIOS, which are a bit > > > > legacy (toolchains aren't that great, etc), although designed to be fairly > > > > small. There's increasing interest in small 32-bit RISC-V cores for this > > > > niche, since the toolchain is improving and the licensing conditions more > > > > liberal (RISC-V itself doesn't design cores, but lots of others - including > > > > ourselves - have made open source cores to the RISC-V specification). > > > > > > Someone showed me that RISC-V can be a small solution, but at that size it isn't fast. So the size performance tradeoff isn't so great. Stack processors tend to be very lean and speed effective without pipelining. I know my processor was designed to run 1 instruction per clock cycle on all instructions. > > > > > > > > > > I don't see that Forth has much to speak to this niche, given these cores > > > > are largely offloading work that's fiddly to do in hardware and people just > > > > want to write some (usually) C to mop up the loose ends. I can't see what > > > > Forth would gain them for this. > > > > > > Not sure what you mean about Forth and "speaking". Forth is a natural language for a stack based processor. Often there is a 1 to 1 mapping of instructions to language words. It is also very easy to retarget to new modifications to a processor design. > > > > > > > > > > If you're talking using these processors as a compute overlay for FPGAs, a > > > > different thing entirely, I'd suggest that a stack processor is likely not > > > > very efficient in terms of throughput. Although I don't have a cite to back > > > > that up. > > > > > > I recall someone had a very large compendium of soft core processors with size and speed measurements with a calculation of something like IPS/LUT. It was amazing what some designs could achieve. I wish I knew where to find that now. > > > > > > -- > > > > > > Rick C. > > > > > > + Get 1,000 miles of free Supercharging > > > + Tesla referral code - https://ts.la/richard11209 > > > > |>very large compendium of soft core processors > > https://opencores.org/projects/up_core_list/summary > > > > Several legacy processors are listed: > > https://opencores.org/projects/up_core_list/downloads > > uP_core_list_by_style-clone190221.pdf > > > > Also look into MISTer as it supports several legacy systems. > > None are competitive speed wise with high performance uP. > > > > With LUTs costing less than $0.001 each some soft core uPs are inexpensive, free if you have unused LUTs and block RAMs. > > For debug, changing block RAM contents is much faster than rerunning the FPGA design. > > I don't think the issue is very often $ with a soft core. At least for me it's about board space and integration with the other FPGA functions. > > Looks like I was mistaken about the speed/size of the RISC-V core. However... it appears to have been hand optimized if I am reading this correctly. > > "GRVI is an FPGA-efficient RISC-V RV32I soft processor core, hand technology mapped and floorplanned for best performance/area" > > That means it can't be ported to other families without much effort to achieve similar results. But still, assuming it drops off to half the numbers it's still a very good design. > > Thanks for the link and also all the work you did on this list. > > -- > > Rick C. > > -- Get 1,000 miles of free Supercharging > -- Tesla referral code - https://ts.la/richard11209
I'm currently showing 36+ distinct RISC-V cores There are probably many more: it's a popular item at many universities. Some of which are optimized for low LUT count. See: https://riscv.org/risc-v-cores/ for a list of FPGA and non-FPGA cores. GRVI was done by Jan Gray. He is expert at keeping LUT count low. It is not open source?
Rick C <gnuarm.deletethisbit@gmail.com> wrote:
> Looks like I was mistaken about the speed/size of the RISC-V core. However... it appears to have been hand optimized if I am reading this correctly. > > "GRVI is an FPGA-efficient RISC-V RV32I soft processor core, hand technology mapped and floorplanned for best performance/area" > > That means it can't be ported to other families without much effort to achieve similar results. But still, assuming it drops off to half the numbers it's still a very good design.
This is our Tinsel multithreaded RISC-V core, which is written in a high-level HDL (BSV) and not hand mapped: https://github.com/POETSII/tinsel https://github.com/POETSII/tinsel/blob/master/doc/fpl-2019-paper.pdf To compare: "Another recent overlay is Gray&rsquo;s GRVI Phalanx [22, 23], a manycore RV32I fabric supporting message-passing via a Hoplite NoC. Gray reports that a single 3-stage GRVI core has an Fmax of 375MHz, uses 320 LUTs, and has a predicted CPI (cycles per instruction) of 1.6. These numbers can be summarised by a single figure of 0.7 MIPS/LUT. By comparison, a single 16-thread pure RV32I Tinsel core (with tightly-coupled data and instruction memories) uses 500 ALMs, clocks at 450MHz, and has a predicted CPI of 1 (there are no pipeline hazards due to multithreading), giving a figure of 0.9 MIPS/LUT. This rough comparison assumes a highlythreaded workload, and involves Fmax and LUT counts taken from different FPGA architectures (Virtex Ultrascale versus Stratix V). Unlike GRVI, Tinsel is not appropriate for singlethreaded workloads. Gray hand-maps a remarkable 1,680 GRVI cores clocking at 250MHz onto a modern, large Xilinx XCVU9P FPGA using relationally placed macros. However, the hand-mapped approach is quite fragile, and its effectiveness could be offset when introducing off-the-shelf IP into the design, e.g. DRAM/SRAM controllers, Ethernet MACs, FPUs, or custom accelerators, all of which are likely to reduce regularity. Off-chip memory access, inter-FPGA communication, and floating-point are left for future work. Gray also cites highlevel programming support as an important goal for the future, which we have begun to explore in this paper." You likely wouldn't want to use it for the kind of management purposes I described earlier, but it's useful for doing compute on larger FPGAs. Theo