Integrated TFT controller in PIC MCUs| page 9

Reply by Dimiter_Popoff ●January 13, 20152015-01-13

On 13.1.2015 &#1075;. 10:53, David Brown wrote:
> On 13/01/15 03:12, Dimiter_Popoff wrote:
>
>> So eventually - a few tons of ink later - you also accept that it is not
>> necessary to save all 32 registers of a 32 register core in an IRQ
>> handler thus there is no advantage whatsoever in having only 16
>> registers - which was the whole point of the discussion.
>> Well better late than never :-).
>
> If I can summarise the arguments here, everyone accepts that you don't
> have to save more registers than you need,

Thanks God, the first two tons of ink seem to have worked eventually.
You claimed exactly the opposite for a long time.

> ... and (baring unusual cases)
> you only have to save /all/ registers during a task context switch.  But
> it is common to have to save all "volatile" registers, of which there
> are more in PPC and MIPS than ARM - when you have more registers in the
> cpu, you /will/ do more unnecessary register saves and restores.

This is where the next few tons of ink will have to go apparently.
What on Earth makes you think having 32 registers rather than 15
makes you have more volatile registers.
Starting to spend the third ton of ink: you only have to save the
registers which you use. There is nothing stopping you from using
only 3-4 registers in an interrupt handler thus saving only 3-4
registers on either machine. If the third ton of ink does not
make that clear for you please recycle back to the first 2 tons, let
us be environmentally friendly.

>> Why 32 registers are a must on a load/store machine with a reasonably
>> deep pipeline I already explained; thus my point that ARM with its
>> 15 GPR-s is a crippled load/store architecture stays valid.
>
> This is, I think the more interesting point, which I do not believe has
> been covered properly.  It is clear for any given function, having more
> registers is not give slower code than having fewer registers, all other
> things being equal.  But will more registers give /significantly/ faster
> code?  If so, under what circumstances is that the case?

I already explained that - when you have data dependencies. The FIR
implementation is a classic example of that. Everything else being
equal if you have only 15 registers the 6-stage pipeline will stall
about 2/3 of the time, check the former ton of ink we spilled.

>... Deep pipeline superscaler processors invariably
> have register renaming, which obsoletes the need for many visible
> registers.

They do have that and it does not obsolete the need in question.
It saves you from unnecessary serializations, yes, but it does not
help against data dependencies - which is what makes 15 registers
too few for a load/store machine (unless it is non-pipelined,
which is how at least initally I am sure ARM have been, but
this is even more crippling).

> Thus I think you are a very long way from being justified in claiming
> that the 16 registers in ARM make the architecture "crippled".  There is
> certainly code for which 32 registers works better than 16 even when you
> have renaming, especially on larger processors, because you want to
> refer to more data at a time without having to reference the stack or
> other memory data.  But that doesn't make the 16-register ARM "crippled".

So eventually you do understand that having 32 registers makes
the (load/store) machine more efficient by definition.
My FIR example demonstrated this can be up to a few *times* more
efficient.
And yet you call an architecture which is crippled by
design - being unable to keep up with the one compared to simply
because it has been designed as it is - non-crippled.
Well your choice of words does not alter the reality - which
is that you just cannot design in 15 registers the equivalent
of a load/store machine with 32 registers. You can build hardware
around that, Intel do that for ages to keep their even more
crippled x86 model alive, but you can build hardware to
do about anything (we covered that, too, so hopefully we will
not go there again).

Clearly initially ARM has been designed saving on design
resources - time, designer skill - to have something
working to sell. Performance-wise its architecture is
dramatically inferior to power exactly because they
made it with only 16 registers, perhaps targeting it at
small, low power applications.  It has been superior to
power for the smallest of applications obviously (like in
the first phones) but when it comes to performance
it is what it is.
Notice "crippled" does not mean unusable; it only means that
under equal conditions using ARM rather than power (for large
enough systems, we covered that already, say 1M+ RAM) ARM
will be at a significant disadvantage, up to a few times
slower. Of course certain tasks can be done by the crippled
CPU no slower, it is just that the opposite is never the case.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by ●January 13, 20152015-01-13

Dimiter_Popoff <dp@tgi-sci.com> wrote:
> On 13.1.2015 &#1075;. 09:33, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

>> The CPU reserves space in the stack frame for the caller-saved FPU
>> registers                                 ^^^^^^^^^^^^^^^^^^^^^^^^
>> ^^^^^^^^^
> This for all FPU registers?

No. The point of the Cortex-M automatic interrupt prologue is to allow 
ISRs to be normal C functions without any assembly glue.


> (probably you can switch it off?)

It is optional.

-a

Reply by Paul ●January 13, 20152015-01-13

In article <m92md7$j6q$1@dont-email.me>, david.brown@hesbynett.no 
says...
.....
 
> For interrupts, function calls and context switches, it seems clear that
> 32 registers involves more saves and restores than 16 registers, but
> there is no convincing argument that this difference is relevant.  This
> round is a tie in the great 16-reg vs. 32-reg battle, and we should move
> on to more interesting points.

The whole 32 V 16 register 'debate' has been trying to see "how many 
fairies fit on on a pinhead" type of discussion. All based on the the
types of applications individual posters normally write.

We have no idea what type of application or even range of applications 
the processor is for, let alone what type of processing is required.

Personally observations on following might have been more useful

1/ Package type options for precessor and other compiler support merits

2/ merits of TFT controller flexibility

3/ What type of things he will do with TFT and if the UI has to have
   animated or moving widgets, or even phone style windowing and wipe 
   effects was more important.

4/ Does the TFT controller have its own frame buffer(s) and their limits

5/ Does it have hardware assist or rely on memory to memopry DMA for 
   copying screens or bits of screens

6/ Graphical library suppoort and limitations (there was a bit early on)

If the application is going to be busy doing lots of memory moves
and the TFT controller accessing shared memory, that is going to have 
bigger load on the application than most other things, in the
MAJORITY of applications.

-- 
Paul Carpenter          | paul@pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/>    PC Services
<http://www.pcserviceselectronics.co.uk/pi/>  Raspberry Pi Add-ons
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.badweb.org.uk/> For those web sites you hate

Reply by Wouter van Ooijen ●January 13, 20152015-01-13

David Brown schreef op 13-Jan-15 om 9:53 AM:
> On 13/01/15 03:12, Dimiter_Popoff wrote:
>
>> So eventually - a few tons of ink later - you also accept that it is not
>> necessary to save all 32 registers of a 32 register core in an IRQ
>> handler thus there is no advantage whatsoever in having only 16
>> registers - which was the whole point of the discussion.
>> Well better late than never :-).
>
> If I can summarise the arguments here, everyone accepts that you don't
> have to save more registers than you need, and (baring unusual cases)
> you only have to save /all/ registers during a task context switch.  But
> it is common to have to save all "volatile" registers, of which there
> are more in PPC and MIPS than ARM - when you have more registers in the
> cpu, you /will/ do more unnecessary register saves and restores.
> Opinions differ wildly on the significance or importance of this.
>
> For interrupts, function calls and context switches, it seems clear that
> 32 registers involves more saves and restores than 16 registers, but
> there is no convincing argument that this difference is relevant.  This
> round is a tie in the great 16-reg vs. 32-reg battle, and we should move
> on to more interesting points.
>
>
>> Why 32 registers are a must on a load/store machine with a reasonably
>> deep pipeline I already explained; thus my point that ARM with its
>> 15 GPR-s is a crippled load/store architecture stays valid.
>
> This is, I think the more interesting point, which I do not believe has
> been covered properly.  It is clear for any given function, having more
> registers is not give slower code than having fewer registers, all other
> things being equal.

I disagree. Registers are not free: the cost die space, power, and 
probably most important: bits in the opcode. Other things being equal 
(and the instruction bandwith being a limit) more register means less 
bits for orther things, with the postential for slower code.

 > But will more registers give /significantly/ faster
> code?  If so, under what circumstances is that the case?  And how does
> it compare to using the same hardware space and/or opcode instruction
> space for other features?
>
> When you have a deep pipeline and superscaler execution (which is not
> the case for most microcontroller cpus), you have to have a lot of data
> passing through the core to make full use of it, and lots of data "in
> flight" at a time.  And since data has to pass through registers, that
> means lots of registers.  But does that mean needing lots of /visible/
> registers in the ISA?  Deep pipeline superscaler processors invariably
> have register renaming, which obsoletes the need for many visible
> registers.
>
> Without register renaming, you need to "manually" (i.e., either the
> assembly programmer or the compiler, rather than the cpu itself) assign
> registers in order to schedule and interleave reading new data in, doing
> calculations, and writing out the results to maximise the throughput -
> your aim is to avoid the key execution units having to wait for incoming
> data.  But with register renaming, you can use the same register names
> all the way - the cpu handles the renaming and scheduling.  The result
> is that the code is smaller, simpler, clearer, and more efficient for
> caching (especially if the cpu has a super-fast cache for small loops).
>
>
> So if you have a PPC core such as the e200z7, with a 10-stage pipeline
> and dual issue execution unit, but no register renaming, you need more
> than 16 named registers to keep the execution units busy in hard
> calculations.  But on a small ARM (Cortex-M3/M4) with a single-issue cpu
> and a three stage pipeline, 16 registers is sufficient.  And on a large
> ARM (Cortex-A) with a multiple issue, deep pipeline core, 16 /named/
> registers is /still/ sufficient because there are a large number of
> /unnamed/ registers for remapping.
>
>
> Thus I think you are a very long way from being justified in claiming
> that the 16 registers in ARM make the architecture "crippled".  There is
> certainly code for which 32 registers works better than 16 even when you
> have renaming, especially on larger processors, because you want to
> refer to more data at a time without having to reference the stack or
> other memory data.  But that doesn't make the 16-register ARM "crippled".
>
>>
>> I never wanted to go into deeper detail on what this or that
>> particular core does right or wrong, the whole point was the basic
>> 32 vs. 16 (15 on ARM really) GPR-s.
>>
>> No, there is not manual available for VPA at the moment as there are
>> no machines on the market runnning DPS  other than our spectrometry
>> devices. Once I decide to make DPS, VPA and the whole thing separately
>> marketable to compete with MS, linux and the like I will announce it
>> loudly enough I suppose.
>>
>> Dimiter
>>
>> ------------------------------------------------------
>> Dimiter Popoff, TGI             http://www.tgi-sci.com
>> ------------------------------------------------------
>> http://www.flickr.com/photos/didi_tgi/
>>
>

Reply by Simon Clubley ●January 13, 20152015-01-13

On 2015-01-13, Dimiter_Popoff <dp@tgi-sci.com> wrote:
> On 13.1.2015 &#1075;. 10:53, David Brown wrote:
>> ... and (baring unusual cases)
>> you only have to save /all/ registers during a task context switch.  But
>> it is common to have to save all "volatile" registers, of which there
>> are more in PPC and MIPS than ARM - when you have more registers in the
>> cpu, you /will/ do more unnecessary register saves and restores.
>
> This is where the next few tons of ink will have to go apparently.
> What on Earth makes you think having 32 registers rather than 15
> makes you have more volatile registers.

Because it depends on the ABI in use.

> Starting to spend the third ton of ink: you only have to save the
> registers which you use. There is nothing stopping you from using
> only 3-4 registers in an interrupt handler thus saving only 3-4
> registers on either machine. If the third ton of ink does not
> make that clear for you please recycle back to the first 2 tons, let
> us be environmentally friendly.
>

If you use an ABI in which most of the 32 registers are callee saved
or write your device specific handler in assembly language and hence
have direct control over the registers in use, then you are correct.

If you use a higher level language to write your handler and the ABI
in use states around half of those registers are caller saved, then,
in the general case, your IRQ wrapper must save those registers before
it calls that handler because the compiler will generate code which
conforms to that ABI.

These days, most people write their drivers in a higher level language
such as C and code from different people/teams has to work together
so the compiler must conform to the ABI in use.

This means that, in the general case, if your ABI requires the caller
to save (say) ~16 registers out of the 32 registers but the code
generated by the compiler for a specific driver only uses 6 of the
caller saved registers, then those ~16 registers still need to be
saved because the wrapper doesn't know any different.

The upside is that you get a general purpose ABI in which everyone's
higher level language code can work together.

Simon.

-- 
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world

Reply by Dombo ●January 13, 20152015-01-13

Op 13-Jan-15 15:31, Wouter van Ooijen schreef:
> David Brown schreef op 13-Jan-15 om 9:53 AM:
>> On 13/01/15 03:12, Dimiter_Popoff wrote:
>>
>>> So eventually - a few tons of ink later - you also accept that it is not
>>> necessary to save all 32 registers of a 32 register core in an IRQ
>>> handler thus there is no advantage whatsoever in having only 16
>>> registers - which was the whole point of the discussion.
>>> Well better late than never :-).
>>
>> If I can summarise the arguments here, everyone accepts that you don't
>> have to save more registers than you need, and (baring unusual cases)
>> you only have to save /all/ registers during a task context switch.  But
>> it is common to have to save all "volatile" registers, of which there
>> are more in PPC and MIPS than ARM - when you have more registers in the
>> cpu, you /will/ do more unnecessary register saves and restores.
>> Opinions differ wildly on the significance or importance of this.
>>
>> For interrupts, function calls and context switches, it seems clear that
>> 32 registers involves more saves and restores than 16 registers, but
>> there is no convincing argument that this difference is relevant.  This
>> round is a tie in the great 16-reg vs. 32-reg battle, and we should move
>> on to more interesting points.
>>
>>> Why 32 registers are a must on a load/store machine with a reasonably
>>> deep pipeline I already explained; thus my point that ARM with its
>>> 15 GPR-s is a crippled load/store architecture stays valid.
>>
>> This is, I think the more interesting point, which I do not believe has
>> been covered properly.  It is clear for any given function, having more
>> registers is not give slower code than having fewer registers, all other
>> things being equal.
>
> I disagree. Registers are not free: the cost die space, power, and
> probably most important: bits in the opcode. Other things being equal
> (and the instruction bandwith being a limit) more register means less
> bits for other things, with the potential for slower code.

Like almost everything in engineering it is a trade off. The number of 
register needed to accomplish a task efficiently also depends on other 
aspects of the ISA. For example with an ISA with more sophisticated 
addressing modes one may need less registers than with a minimalistic 
RISC ISA. Many modern (superscalar) processors have internally more 
registers than exposed via the ISA, register renaming technique reduces 
the chance that registers become a performance bottleneck. With the x86 
64-bit instruction set its designers choose to expand the number of 
general purpose registers from 8 to 16. They could have easily chosen a 
larger number of registers but apparently their analysis showed that the 
benefit of more registers did not outweigh the downsides. I'd say that 
it is a bit too simplistic to state that a ISA that has only 15 GP 
registers must be crippled.

I think this discussion about the optimum number of processor would be 
more appropriate in comp.arch were the people are that are/were involved 
with processor design.

Reply by Dimiter_Popoff ●January 13, 20152015-01-13

On 13.1.2015 &#1075;. 22:00, Simon Clubley wrote:
> On 2015-01-13, Dimiter_Popoff <dp@tgi-sci.com> wrote:
>> On 13.1.2015 &#1075;. 10:53, David Brown wrote:
>>> ... and (baring unusual cases)
>>> you only have to save /all/ registers during a task context switch.  But
>>> it is common to have to save all "volatile" registers, of which there
>>> are more in PPC and MIPS than ARM - when you have more registers in the
>>> cpu, you /will/ do more unnecessary register saves and restores.
>>
>> This is where the next few tons of ink will have to go apparently.
>> What on Earth makes you think having 32 registers rather than 15
>> makes you have more volatile registers.
>
> Because it depends on the ABI in use.

This is at least the third time I explain this to you but I don't
mind, I'll do it as many times as it takes: there are many ways
to destroy something working other than inept programming, some
of them much easier.

So what is the guaranteed IRQ latency on your ARM core of choice
running linux with some SATA drives, multiple windows, ethernet,
some serial interfaces. Try to give some figure - please notice
the word "guaranteed", I know how much the linux crowd prefers
to talk "in general".

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by David Brown ●January 14, 20152015-01-14

On 13/01/15 21:57, Dombo wrote:
> Op 13-Jan-15 15:31, Wouter van Ooijen schreef:
>> David Brown schreef op 13-Jan-15 om 9:53 AM:
>>> On 13/01/15 03:12, Dimiter_Popoff wrote:
>>>
>>>> So eventually - a few tons of ink later - you also accept that it is
>>>> not
>>>> necessary to save all 32 registers of a 32 register core in an IRQ
>>>> handler thus there is no advantage whatsoever in having only 16
>>>> registers - which was the whole point of the discussion.
>>>> Well better late than never :-).
>>>
>>> If I can summarise the arguments here, everyone accepts that you don't
>>> have to save more registers than you need, and (baring unusual cases)
>>> you only have to save /all/ registers during a task context switch.  But
>>> it is common to have to save all "volatile" registers, of which there
>>> are more in PPC and MIPS than ARM - when you have more registers in the
>>> cpu, you /will/ do more unnecessary register saves and restores.
>>> Opinions differ wildly on the significance or importance of this.
>>>
>>> For interrupts, function calls and context switches, it seems clear that
>>> 32 registers involves more saves and restores than 16 registers, but
>>> there is no convincing argument that this difference is relevant.  This
>>> round is a tie in the great 16-reg vs. 32-reg battle, and we should move
>>> on to more interesting points.
>>>
>>>> Why 32 registers are a must on a load/store machine with a reasonably
>>>> deep pipeline I already explained; thus my point that ARM with its
>>>> 15 GPR-s is a crippled load/store architecture stays valid.
>>>
>>> This is, I think the more interesting point, which I do not believe has
>>> been covered properly.  It is clear for any given function, having more
>>> registers is not give slower code than having fewer registers, all other
>>> things being equal.
>>
>> I disagree. Registers are not free: the cost die space, power, and
>> probably most important: bits in the opcode. Other things being equal
>> (and the instruction bandwith being a limit) more register means less
>> bits for other things, with the potential for slower code.
> 
> Like almost everything in engineering it is a trade off. The number of
> register needed to accomplish a task efficiently also depends on other
> aspects of the ISA. For example with an ISA with more sophisticated
> addressing modes one may need less registers than with a minimalistic
> RISC ISA. Many modern (superscalar) processors have internally more
> registers than exposed via the ISA, register renaming technique reduces
> the chance that registers become a performance bottleneck. With the x86
> 64-bit instruction set its designers choose to expand the number of
> general purpose registers from 8 to 16. They could have easily chosen a
> larger number of registers but apparently their analysis showed that the
> benefit of more registers did not outweigh the downsides. I'd say that
> it is a bit too simplistic to state that a ISA that has only 15 GP
> registers must be crippled.

It's useful to make the distinction between /named/ registers (exposed
in the ISA to the programmer) and /unnamed/ registers (implementation
dependent, internal registers for register renaming).  When designing
the amd64 ISA, the AMD folks, working tightly with gcc developers, Linux
kernel developers, and presumably many other people, concluded that 16
named GP registers was the right balance for the architecture.  It was
long established that the 8 registers of x86 was too few, but as you say
their analysis did not show much benefit of more than 16 registers - and
the disadvantages (opcode space, and extra register stores in function
calls) outweighed any advantage.

Internally, implementations of amd64 might have hundreds of unnamed GP
registers.

Also note that the amd64 architecture has lots of SIMD registers as well
as GP registers.  I think in most examples where large numbers of GP
registers would help, SIMD registers are a better solution - and are
therefore implemented on most fast cpu designs.

Finally, the discussion was centred on load-store architectures such as
ARM, MIPS and PPC.  x86/amd64 are not load-store, and can do more with
fewer named registers.  Dimiter's assertion was that a load-store
architecture is inherently crippled if it has only 16 registers - he has
not commented on CISC architectures.

A more relevant example is the 64-bit ARM architecture - which has 32 GP
registers.  That does not in any way prove that the old 32-bit ARM was
"crippled" with only 16 registers - but it does show that for such a
large processor, the extra registers give a positive trade-off.

> 
> I think this discussion about the optimum number of processor would be
> more appropriate in comp.arch were the people are that are/were involved
> with processor design.
>

Reply by Tom Gardner ●January 14, 20152015-01-14

On 14/01/15 02:11, Dimiter_Popoff wrote:
> So what is the guaranteed IRQ latency on your ARM core of choice
> running linux with some SATA drives, multiple windows, ethernet,
> some serial interfaces. Try to give some figure - please notice
> the word "guaranteed", I know how much the linux crowd prefers
> to talk "in general".

Having L1/L2/L3 caches will instantly introduce a high variation
between the mean and max latencies. Even for i486s with their
minimal cache and no operating system, a 10:1 variability was
visible.

Any variability to do with register saving will be completely
insignificant compared to the effects of caches. Unless, of
course, you are having to dump the entire hidden state of
an Itanic processor :)

Reply by Dimiter_Popoff ●January 14, 20152015-01-14

On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote:
> On 14/01/15 02:11, Dimiter_Popoff wrote:
>> So what is the guaranteed IRQ latency on your ARM core of choice
>> running linux with some SATA drives, multiple windows, ethernet,
>> some serial interfaces. Try to give some figure - please notice
>> the word "guaranteed", I know how much the linux crowd prefers
>> to talk "in general".
>
> Having L1/L2/L3 caches will instantly introduce a high variation
> between the mean and max latencies. Even for i486s with their
> minimal cache and no operating system, a 10:1 variability was
> visible.

Yes, though on some processors one has the ability to lock part of the
L1 cache - which allows to have it dedicated to interrupts which can
make things a lot tighter (by saving the necessity to update entire
cachelines).

Overall the latency variability obviously increases as processor
sizes increase but then total execution times decrease, memories
get faster etc.  so the worst case latency can still be very low.
On the 5200b which I use I have never needed to resort to any
cache locks etc., all I do is just stay masked only as absolutely
necessary.

> Any variability to do with register saving will be completely
> insignificant compared to the effects of caches. Unless, of
> course, you are having to dump the entire hidden state of
> an Itanic processor :)
>

Well we have not come to that obvious point yet I am afraid :-).
Let us first have the figure on the worst-case linux IRQ latency
I asked for then put into its context the try of ARM/linux
devotees about lower latency by not having enough registers :-).

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/