Custom CPU Designs| page 9

Reply by Tom Gardner ●April 20, 20202020-04-20

On 20/04/20 14:58, David Brown wrote:
> On 18/04/2020 21:38, Rick C wrote:
> Yes, especially for the fine details.&nbsp; But it gets harder again if there can be 
> a varying number of clock cycles in events.&nbsp; You do have the big advantage with 
> FPGAs that timing of different parts is usually independent, unlike on MCUs.
> 
>> It's an area where MCUs can be very difficult to analyze.
>>
> 
> Absolutely.
> 
> You could say that XMOS devices and tools are a middle ground here.

Precisely.

Maybe Rick will get there, if he bothers to put in the effort :)

Reply by Tom Gardner ●April 20, 20202020-04-20

On 18/04/20 22:21, Rick C wrote:
> On Saturday, April 18, 2020 at 10:57:55 AM UTC-4, David Brown wrote:
>> (Arguably that's what you have on an FPGA - lots of tiny bits that do
>> very little on their own.  But the key difference is the tools - FPGA's
>> would be a lot less popular if you had to code each LU individually, do
>> placement manually by numbering them, and write a routing file by hand.)
> 
> Huh?  Why can't any of that be automated in the many CPU chip?  Certainly the issue is not so important with only 144 processors, but the same size chip would have many thousands of processors in a more modern technology.  I think the GA144 is 180 nm.  Bring that down to 15 nm and you have 15,000 nodes on a chip just 5 mm square!!!  That will require better tools for sure.
> 
> Some of your use of language is interesting.  "lots of tiny bits that do very little on their own".  That sounds like the way people write software.  They decompose a large, complex design into lots of tiny routines that individually do little on their own.  How can you manage all that complexity???  Yes, I think the Zeno paradox must apply so that no large program can ever be finished.  That certainly happens sometimes.

Reply by Tom Gardner ●April 20, 20202020-04-20

On 18/04/20 22:21, Rick C wrote:
 > On Saturday, April 18, 2020 at 10:57:55 AM UTC-4, David Brown wrote:
 >> (Arguably that's what you have on an FPGA - lots of tiny bits that do
 >> very little on their own.  But the key difference is the tools - FPGA's
 >> would be a lot less popular if you had to code each LU individually, do
 >> placement manually by numbering them, and write a routing file by hand.)
 >
 > Huh?  Why can't any of that be automated in the many CPU chip?

Good question; if it is easy, why /hasn't/ it been automated?

The software tools are key; they have to exist and be usable.

If you get that wrong you end up with the Itanium (a.k.a. Itanic)
disaster where the hardware's performance was predicated on the
compilers being sufficiently good - but they never were and
people with decades of experience said they never would be.

Or, in the FPGA world, why aren't Pilkington's "sea of gates"
FPGAs around now - because the tools weren't available.

The tools have to be usable; with software (inc XMOS) developers
are used to instant compile-download-debug cycles. FPGA tools can
be much slower (arguably an advantage, but that's a separate
discussion!).

 > Some of your use of language is interesting.  "lots of tiny bits
 > that do very little on their own".  That sounds like the way people
 > write software.  They decompose a large, complex design into lots
 > of tiny routines that individually do little on their own.  How
 > can you manage all that complexity???  Yes, I think the Zeno
 > paradox must apply so that no large program can ever be finished.
 > That certainly happens sometimes.

Take that to the limit and you end up with dataflow machines.
They have been around since the early 80s, and there are good
reasons why they have never become popular.

The programming style for the GA144 may well have to end up
being close to a dataflow machine. But I've forgotten the
relevant details of the GA144.

Reply by David Brown ●April 20, 20202020-04-20

On 19/04/2020 23:47, Rick C wrote:

> I agree that the GA144 finds very few sockets when the entire package
> is considered.  But the problem is not because the multiprocessing is
> not an easy way to do multitasking.
I don't want to go through the whole post, because the thread is getting 
long and time-consuming.  (I'm reading all your posts - but practical 
considerations are against my replying to everything.)

I agree that multiprocessing is a viable technique, and not necessarily 
harder than other ways of solving problems.  However, I don't think it 
is necessarily easier than multitasking on one (or a few) cores.

I agree that the hardware of this chip is interesting, and has some 
smart and efficient ideas - and from a purely hardware perspective, it 
could be a very good choice in a number of types of application.

I believe there are two key issues with the GA144 that outweigh the 
hardware itself or the learning and alternative thinking needed to make 
use of a multi-processor design.

One is the company - their size, their attitude, their documentation, 
their support, etc.  You've mentioned several of these yourself.

The other is the software tools.  I haven't tried them, but I've read 
several of their application notes and information about them.  Either 
the company have done an amazingly bad job of showing them, or they are 
the biggest pile of **** I've seen this century.  And I have seen (and 
used) some really bad tools over the years.

Even if both these points were fixed - they had a live and active 
company that understood customers and their needs, and had realised that 
most developers have access to a modern computer and don't need a 
development environment that runs from a floppy disk, there would still 
be a huge obstacle to overcome - almost no developers use Forth.  Any 
large enough embedded development department is going to have people 
with FPGA experience.  Finding new employees to build up FPGA competence 
is not much harder than finding new experts in embedded software - and 
there are plenty of consulting companies ready to help.  But for Forth, 
how many programmers are there?  How many will not have retired in the 
next 10 years?

Of course any programmer can /learn/ Forth.  But what company would 
invest in that?  Even assuming a new toolchain for the GA144 based on 
modern ANS Forth rather than weird experimental (and dead) colorForth, 
you are talking about a great deal of investment for a very narrow use-case.

It is certainly possible that this would be the right decision.  But my 
point is that it would take an outstanding "killer application" to make 
it worth doing.

I suspect the whole thing would be better off if a new toolset were made 
with a completely different language - even if that language were 
invented specifically.  Drop Forth (or is that "Forth DROP" ?) and make 
a modern language at a higher level, with a strong emphasis on 
multi-processing, data passing and synchronisation (CSP-style, unless 
someone thinks of anything better), with optimising tools handling the 
distribution around the chip.

Then at least someone using the chip could feel they are investing in 
the future, rather than something that views colour screens as a cool 
new invention.

(I wonder if "Go" would be a good fit?  I don't know much about Go, but 
I do know it has an emphasis on many small minimal overhead threads and 
CSP-style communication.)

> 
>> XMOS takes a bit of changed ideas and some new ways of thinking to get
>> the best from them.  But the (virtual) cores are solid devices,
>> programmed in decent modern languages (after some very frustrating
>> limitations in their early days) with a good IDE and tools, and a large
>> community of users well supported by the company.  There are some things
>> about the chips that I think are downright silly limitations, and I
>> think a bit less "purism" would help, but they are a world apart from
>> the GA144.
> 
> Yes, too bad the XMOS is the best solution in only a tiny part of the market.

XMOS is a viable company with a solid community and plenty of customers. 
  It's small compared to other microcontroller designs - but it is a 
world ahead of GA144.

> 
>> Still, there are devices from Efinix (I know almost nothing about the
>> company) and Lattice for a dollar or so.   Yes, these are fine-pitch BGA
>> packages, but the range of companies that can handle these is much
>> greater than it used to be (even if it is by outsourcing the board
>> production).
> 
> Lattice has some more "friendly" packages for their low end parts.  As soon as I have a need I expect to be using one.  Currently they don't do anything better (that I need) than the 10+ year old part I'm currently using.
> 
> Lattice is always on my radar because they are the only company who seems to be taking the low end seriously.
> 

Nice to know.

Do you know anything about Efinix?

Reply by David Brown ●April 20, 20202020-04-20

On 17/04/2020 17:34, Theo wrote:
> Paul Rubin <no.email@nospam.invalid> wrote:
>> Grant Edwards <invalid@invalid.invalid> writes:
>>> Definitely. The M-class parts are so cheap, there's not much point in
>>> thinking about doing it in an FPGA.
>>
>> Well I think the idea is already you have other stuff in the FPGA, so
>> you save a package and some communications by dropping in a softcore
>> rather than using an external MCU.  I'm surprised that only high end
>> FPGA's currently have hard MCU's already there.  Just like they have DSP
>> blocks, ram blocks, SERDES, etc., they might as well put in some CPU
>> blocks.
> 
> I think part of the problem is the ARM licensing cost - if the license cost
> is (random number) 5% of the silicon sticker price that's fine when it's a
> $1 MCU, but when it's a $10000 FPGA that hurts.  

I'm not sure that's valid.  First, do you know that the ARM licensing 
costs work that way?

Markup is usually a significantly higher percentage at the high end. 
For a $1 part, the profit will be a few percent.  On a $10K part, it 
will be tens of percent.  Yes, paying 5% licencing will be a large 
number in absolute terms, but you are still left with a bigger lump in 
the end.

Mind you, economics is not my forte, so I could be getting this 
completely wrong.

(And whatever the numbers, RISC-V changes things significantly.)

Reply by David Brown ●April 20, 20202020-04-20

On 19/04/2020 20:52, Przemek Klosowski wrote:
> On Thu, 16 Apr 2020 17:13:41 -0700, Paul Rubin wrote:
> 
>> Grant Edwards <invalid@invalid.invalid> writes:
>>> Definitely. The M-class parts are so cheap, there's not much point in
>>> thinking about doing it in an FPGA.
>>
>> Well I think the idea is already you have other stuff in the FPGA, so
>> you save a package and some communications by dropping in a softcore
>> rather than using an external MCU.  I'm surprised that only high end
>> FPGA's currently have hard MCU's already there.  Just like they have DSP
>> blocks, ram blocks, SERDES, etc., they might as well put in some CPU
>> blocks.
> 
> Maybe Risc-V will catch on. The design is FOSS, as is the toolchain (GDB
> and LLVM have Risc-V backends already for a while), and the simple
> versions take very few gates.
> https://github.com/SpinalHDL/VexRiscv
> https://hackaday.com/2019/11/19/emulating-risc-v-on-an-fpga/
> 

Has anyone here tried SpinalHDL ?  I had a look at it, and it seems very 
appealing.

Reply by David Brown ●April 20, 20202020-04-20

On 17/04/2020 18:13, Rick C wrote:
> On Friday, April 17, 2020 at 4:38:04 AM UTC-4, David Brown wrote:
>> On 17/04/2020 06:54, Clifford Heath wrote:
>>> On 17/4/20 1:01 pm, Rick C wrote:
>>>> On Thursday, April 16, 2020 at 10:35:07 PM UTC-4, Clifford
>>>> Heath wrote:
>>>>> 
>>>>> Some US language is ancient English (but modern English has
>>>>> moved on), and sometimes its the reverse.
>>>>> "Aluminium/Aluminum" is an example where English moved on (to
>>>>> improve standardisation).
>>>> 
>>>> Sorry, can you explain the aluminium/aluminum thing?  I know
>>>> some people pronounce it with an accent (not saying who) but I
>>>> don't get the English moved on thing.
>>> 
>>> Aluminum is the original name, which Americans retained when the
>>> English decided to standardise on the -ium extension that was
>>> being used with most other metals already.
>>> 
>>> That's my understanding anyhow.
>>> 
>>> CH
>> 
>> Yes, that is correct (AFAIK).  This is one of the differences
>> between spoken English and spoken American that always annoys me
>> when I hear it - I don't really know why, and of course it is
>> unfair and biased.  The other one that gets me is when Americans
>> pronounce "route" as "rout" instead of "root".  A "rout" is when
>> one army chases another army off the battlefield, or a groove cut
>> into a piece of wood.  It is not something you do with a network
>> packet or pcb track!
>> 
>> I'm sure Americans find it equally odd or grating when they hear
>> British people "rooting" pcbs and network packets.
>> 
>> :-)
> 
> I've seen the word "rooted" used in a much more vulgar sense in many
> British works to think you don't know why that just sounds wrong when
> applied to PWBs.
> 

In British English, /anything/ can be used to sound vulgar!  And the 
word "root" has several established meanings - most of them perfectly 
decent.  (A common one is "support", as in "rooting for a football team".)

In Glaswegian, any word can be used as an adjective to mean "drunk".  "I 
got absolutely rooted last night" - anyone from Glasgow will know 
exactly what you mean.

Reply by Tom Gardner ●April 20, 20202020-04-20

On 20/04/20 15:53, David Brown wrote:
> Beyond that, you have mostly the same issues.&nbsp; Deadlock, livelock, 
> synchronisation - they are all something you have to consider whether you are 
> making an FPGA design, multi-tasking on one cpu, or running independent tasks on 
> independent processors.
> 
> Task prioritising is an important issue.&nbsp; But it is not just for multitasking on 
> a single cpu.&nbsp; If you have a high priority task A that sometimes has to wait for 
> the results from a low priority task B, you have an issue to deal with.&nbsp; That 
> applies whether they are on the same cpu or different ones.&nbsp; On a single cpu, 
> you have the solution of bumping up the priority for task B for a bit (priority 
> inheritance) - on different cpus, you just have to wait.

Multiple task priorities is too often used as a sticking plaster
to cure livelock/deadlock problems - or more accurately /appear/
to cure them.

I much prefer to have two priorities: normal and interrupt,
and then to have a supervisor of some sort which specifies
which task runs at any given time.

Any such supervisor is probably application specific, coded
explicitly, and can be inspected/debugged/modified like any
ordinary task.

One I've used in the past is to have a "demultiplexer" which
directs job fragments into different FIFOs, for processing
by worker threads. That was naturally scalable and can be made
"highly available", which is mandatory in telecom applications.

Reply by David Brown ●April 20, 20202020-04-20

On 20/04/2020 18:03, Tom Gardner wrote:
> On 20/04/20 15:53, David Brown wrote:
>> Beyond that, you have mostly the same issues.&nbsp; Deadlock, livelock, 
>> synchronisation - they are all something you have to consider whether 
>> you are making an FPGA design, multi-tasking on one cpu, or running 
>> independent tasks on independent processors.
>>
>> Task prioritising is an important issue.&nbsp; But it is not just for 
>> multitasking on a single cpu.&nbsp; If you have a high priority task A that 
>> sometimes has to wait for the results from a low priority task B, you 
>> have an issue to deal with.&nbsp; That applies whether they are on the same 
>> cpu or different ones.&nbsp; On a single cpu, you have the solution of 
>> bumping up the priority for task B for a bit (priority inheritance) - 
>> on different cpus, you just have to wait.
> 
> Multiple task priorities is too often used as a sticking plaster
> to cure livelock/deadlock problems - or more accurately /appear/
> to cure them.
> 
> I much prefer to have two priorities: normal and interrupt,
> and then to have a supervisor of some sort which specifies
> which task runs at any given time.
> 
> Any such supervisor is probably application specific, coded
> explicitly, and can be inspected/debugged/modified like any
> ordinary task.

A better solution, I think, is that it should not matter which task is 
running at any given time - because these tasks run to handle a specific 
situation then yield waiting for an event.  Multiple priorities are 
convenient to express which tasks should be handled with lower 
latencies.  I'm not keen on multiple tasks that have some kind of 
time-sharing switching between them, or round-robin pre-emptive 
multitasking.  (Unless by "supervisor" here you just mean a "while 
(true)" loop that calls the tasks one after the other, which is fine.)

> 
> One I've used in the past is to have a "demultiplexer" which
> directs job fragments into different FIFOs, for processing
> by worker threads. That was naturally scalable and can be made
> "highly available", which is mandatory in telecom applications.

Reply by Tom Gardner ●April 20, 20202020-04-20

On 20/04/20 18:39, David Brown wrote:
> On 20/04/2020 18:03, Tom Gardner wrote:
>> On 20/04/20 15:53, David Brown wrote:
>>> Beyond that, you have mostly the same issues.&nbsp; Deadlock, livelock, 
>>> synchronisation - they are all something you have to consider whether you are 
>>> making an FPGA design, multi-tasking on one cpu, or running independent tasks 
>>> on independent processors.
>>>
>>> Task prioritising is an important issue.&nbsp; But it is not just for multitasking 
>>> on a single cpu.&nbsp; If you have a high priority task A that sometimes has to 
>>> wait for the results from a low priority task B, you have an issue to deal 
>>> with.&nbsp; That applies whether they are on the same cpu or different ones.&nbsp; On a 
>>> single cpu, you have the solution of bumping up the priority for task B for a 
>>> bit (priority inheritance) - on different cpus, you just have to wait.
>>
>> Multiple task priorities is too often used as a sticking plaster
>> to cure livelock/deadlock problems - or more accurately /appear/
>> to cure them.
>>
>> I much prefer to have two priorities: normal and interrupt,
>> and then to have a supervisor of some sort which specifies
>> which task runs at any given time.
>>
>> Any such supervisor is probably application specific, coded
>> explicitly, and can be inspected/debugged/modified like any
>> ordinary task.
> 
> A better solution, I think, is that it should not matter which task is running 
> at any given time - because these tasks run to handle a specific situation then 
> yield waiting for an event.  

If there are many tasks with event(s) that have arrived, it
becomes necessary to choose which to run. The choice of
which to run is, of necessity, application dependent.

Applications can emphasise forward progress, mean or 95th
percentile latency, earliest deadline first, priority, etc.


> Multiple priorities are convenient to express which 
> tasks should be handled with lower latencies.&nbsp; I'm not keen on multiple tasks 
> that have some kind of time-sharing switching between them, or round-robin 
> pre-emptive multitasking.&nbsp; (Unless by "supervisor" here you just mean a "while 
> (true)" loop that calls the tasks one after the other, which is fine.)

In one form or another, yes.

But the word "task" has too many different meanings in
this context. Many would map a task onto a process, thread
or fibre. That frequently leads to "suboptimal behaviour"
sooner or later, and it can be a problem sorting out the
cause and effect.

To expand on my previous comment below...

My preference is that an event is encapsulated in an object
when it arrives. The scheduler puts the object in a FIFO,
and worker threads (approx one per core) take the object
and process the event.

Priority can be via multiple queues, and there are other
obvious techniques to ensure other properties are met.


>> One I've used in the past is to have a "demultiplexer" which
>> directs job fragments into different FIFOs, for processing
>> by worker threads. That was naturally scalable and can be made
>> "highly available", which is mandatory in telecom applications.