Integrated TFT controller in PIC MCUs| page 6

Reply by Wouter van Ooijen ●January 10, 20152015-01-10

glen herrmannsfeldt schreef op 10-Jan-15 om 3:25 AM:
> Simon Clubley <clubley@remove_me.eisner.decus.org-earth.ufp> wrote:
>> On 2015-01-09, Dimiter_Popoff <dp@tgi-sci.com> wrote:
>
>>> Well I really cannot simplify the concept of saving say 4 out of 32
>>> registers, using only them in an IRQ handler, then restoring only
>>> them and returning from the exception.
>
>> In the general case, you have to push and pop all the registers every
>> time you take an interrupt.
>
> Some processors have multiple register sets that might avoid that.
> SPARC has register windows, such that they don't have to save to
> memory until all the windows are in use.  I don't know if that is
> for interrupts, too.

That is a nice concept, but it has two problems:

- when you overflow the avaiable resister sets, you must spill to 
memory. whether you need to do this, depends on where you are in the 
register set. this makes timing difficult to predict, which is not nice 
for a real-time system

- imagine a context switch. now you have to save/restore all register sets!

In general, fat-context CPUs are better at single-threaded no-interrupt 
applications, but worse for switch-often interrupt-heavy applications.

Wouter van Ooijen

Reply by Wouter van Ooijen ●January 10, 20152015-01-10

Anders.Montonen@kapsi.spam.stop.fi.invalid schreef op 10-Jan-15 om 10:51 AM:
> Vladimir Ivanov <none@none.tld> wrote:
>
>> Do all instruction forms have wider version to accomodate R8-R15 usage?
>
> I believe that is the case.

Not in Cortex-M0. About the only thing you can do with the upper 
registers (r8-12) is copy to/from a lower register. As an illustration: 
a cooperative context switch on a cortex-m0:


    .cpu cortex-m0
    .global switch_from_to
    .text
    .align 2

    // extern "C" void switch_from_to(
    //    int ** current_stack_pointer,
    //    int * next_stack_pointer
    // );
switch_from_to:

    // save current context on the stack
    push { r4 - r7, lr }
    mov r2, r8
    mov r3, r9
    mov r4, r10
    mov r5, r11
    mov r6, r12
    push { r2 - r6 }

    // *current_stack_pointer = sp
    mov r2, sp
    str r2, [ r0 ]

    // sp = next_stack_pointer
    mov sp, r1

    // restore the new context from the new stack
    pop { r2 - r6 }
    mov r12, r6
    mov r11, r5
    mov r10, r4
    mov r9, r3
    mov r8, r2
    pop { r4 - r7, pc }

Wouter van Ooijen

Reply by ●January 10, 20152015-01-10

Wouter van Ooijen <wouter@voti.nl> wrote:
> Anders.Montonen@kapsi.spam.stop.fi.invalid schreef op 10-Jan-15 om 10:51 AM:
>> Vladimir Ivanov <none@none.tld> wrote:
>>
>>> Do all instruction forms have wider version to accomodate R8-R15 usage?
>> I believe that is the case.
> Not in Cortex-M0.

That is true, but they implement ARMv6-M Thumb, not Thumb-2.

-a

Reply by Wouter van Ooijen ●January 10, 20152015-01-10

Anders.Montonen@kapsi.spam.stop.fi.invalid schreef op 10-Jan-15 om 11:45 AM:
> Wouter van Ooijen <wouter@voti.nl> wrote:
>> Anders.Montonen@kapsi.spam.stop.fi.invalid schreef op 10-Jan-15 om 10:51 AM:
>>> Vladimir Ivanov <none@none.tld> wrote:
>>>
>>>> Do all instruction forms have wider version to accomodate R8-R15 usage?
>>> I believe that is the case.
>> Not in Cortex-M0.
>
> That is true, but they implement ARMv6-M Thumb, not Thumb-2.

In that case I was confused about the context of the statement :)

Wouter

Reply by David Brown ●January 10, 20152015-01-10

On 09/01/15 23:04, Dimiter_Popoff wrote:
> On 09.1.2015 &#1075;. 11:47, David Brown wrote:
>> On 09/01/15 00:22, Dimiter_Popoff wrote:
>>> On 09.1.2015 &#1075;. 00:53, Wouter van Ooijen wrote:
>>>> Dimiter_Popoff schreef op 08-Jan-15 om 11:18 PM:
>>>>> On 08.1.2015 &#1075;. 23:25, David Brown wrote:
>>>>>> ...  (Just as "cpus should
>>>>>> have more than 16 core registers" is a good reason for disliking
>>>>>> ARM's,
>>>>>> if that is your opinion.)
>>>>>
>>>>> Results of arithmetic calculations are not exactly what I would
>>>>> call an
>>>>> opinion.
>>>>> 16 registers - one of which being reserved for the PC - are too few
>>>>> for
>>>>> a load/store machine.
>>>>
>>>> 16 is a fact, but the rest is nothing but opinion.
>>>>
>>>>   > Clearly it will work but under equal conditions
>>>>> will be slower than if it had 32 registers, sometimes much slower.
>>>>
>>>> More registers can be slower too.
>>>
>>> Yes, 32 is about the optimum I suppose. But I have not really analyzed
>>> that, what I have - and demonstrated by an example which you chose to
>>> ignore - is the comparison 32 vs. 16.
>>
>> Certainly it is possible to pick examples where 32 registers is more
>> effective than 16 - but equally we can pick examples where 16 registers
>> is more efficient (such as context switching, or code with a lot of
>> small functions).  Examples are illustrative, but not proof of a general
>> rule.
>
> You have yet to prove this point. Context switching is not a valid
> example, as I explained in my former post which you must have read
> prior to replying to (for those who have not, context switching is
> responsible for a fraction of a percent of CPU time, consequently
> halving or even completely eliminating that can bring a fraction of
> a percent improvement, i.e. it is negligible).

I think it would be wrong to talk about /proving/ points here - proper 
proof would require implementing the same algorithms in different 
architectures (or preferably in the same basic architecture, but with 
different register counts) and comparing code densities, run times, 
cache hit/miss counts, memory bandwidth, etc.  That is clearly far 
beyond what anyone will do for a Usegroup post!

Context switching is not a valid example for /you/, based on the figures 
/you/ gave.  But it is certainly not something that can be dismissed so 
easily.  I have written systems that had timer interrupts at 100,000 
times per second - thus interrupt overhead is 100 times as important as 
in your single example with 1000 interrupts per second.  I have written 
one system where there were 40 clock between interrupts - that does not 
leave a lot of time for context saving and restoring.

In general, in bigger systems (and PowerPC cores tend to be used in 
bigger systems than most of the embedded cores we see here) you try to 
avoid many interrupts, and prefer DMA and more sophisticated peripherals 
to keep the interrupt rate low.  In smaller microcontrollers, you don't 
have such sophistication - your UART might have only a single buffer, so 
you will have interrupts for every character transmitted or received. 
But interrupts are less of an overhead, partly because of the lower 
register count.  It's a different balance.

Other than interrupt context switches, there is also function call 
overhead.  I don't know what sort of calling conventions you use in your 
VPA, but for high-level languages there is normally a defined convention 
with some registers being caller-saved (or "volatile") and others being 
callee-saved (or "non-volatile").  The split can vary between compilers 
on a target, or can been defined by the target's standard ABI.  When you 
are calling unknown code (i.e., code from a separately compiled module), 
the compiler (or assembly programmer) must follow the calling 
convention.  That means a function must save any callee-saved registers 
before using them, in case the calling function had data there.  And it 
must save caller-saved registers before function calls, in case the 
called function uses them.  There is always a significant amount of 
unnecessary saving and restoring in this process, and it increases with 
the number of registers in the system and with the number of small and 
simple functions (because with larger functions that really use the 
data, the save/restores are no longer unnecessary and therefore not 
overhead).

So if a PPC function (following the standard PPC EABI conventions) needs 
to use any of the 18 "non-volatile" registers, it must save the old 
values and restore them on exit - even if the calling function does not 
need the old values.  And if it is calling another function, then it 
must save any of the 11 "volatile" registers it wants to keep - even if 
the called function does not touch them.

>
> I would certainly be interested in the example you claim to be able
> to produce demonstrating how 16 registers can be more effective.
> My example covers a typical, widespread application - that of a FIR.
> Let us see yours.

I can't produce a /function/ that is more efficient with 16 registers 
than 32 registers - but I hope that above I have explained how it make a 
difference with chains of small functions (or functions with few 
register demands).

I think it is reasonable to say that as programs (in the embedded world) 
have got bigger, memories have got bigger, and compilers have got 
better, then the balance for many systems has moved more towards 32 
registers rather than 16 registers, in the same way that it has moved 
towards 32-bit cpus from 8 and 16-bit cpus.

>
>>>> More registers => more bits in an instruction to specify a register =>
>>>> more to read from code mempry => slower
>>>
>>> Not really. Load/store machines typically  have a fixed 32 bit
>>> instruction word; using that for only 16 registers is simply
>>> waste of space, you will have less information packed in the
>>> opcode thus you will need more opcodes to do the same job, thus
>>> having to do *more* memory fetches.
>>
>> Many load/store machines have some sort of compressed or limited
>> instruction format for greater efficiency (especially of instruction
>> cache).  ARM has a couple of "thumb" modes, MIPs have an equivalent, and
>> on the PPC I have seen various schemes.
>
> I'd be interested to see those sub-32 bit opcode schemes you talk about
> on power, I have yet to encounter one of them. Certainly none of them
> is present on the cores I use or have investigated. (I am just curious,
> not in need of something like that).

The one I am most familiar with is VLE, which is used by many of 
Freescale's PPC microcontrollers.  (It may be used by others too, but I 
have only used PPC from Freescale.)  I also remember that Freescale had 
some chips with a sort of compression scheme where code was decompressed 
while it was loaded into cache - but I have forgotten the details.  VLE 
is the same basic idea as ARM's Thumb2 and the MIPS equivalent.

>
>> Even in the full 32-bit ARM set, the bits "saved" by having fewer
>> registers are used for the conditional execution bits and the barrel
>> shifter.
>
> Can you please elaborate on that. What can ARM do using the barrel
> shifter which you cannot do using the likes of rlwinm, rlwnm or rlwimi
> on power?

The barrel shifter bits apply to a wide range of ARM instructions - 
meaning you can effectively tag a zero-cost shift instruction into many 
other instructions.  Thus if you want "a = b + c * 16;", on the PPC you 
need two instructions (shift then add) with pipeline/scheduling 
considerations between them, while on the ARM it is done in one 
instruction and one cycle.

With 16-bit Thumb instructions, the barrel shifter only works on loads, 
stores, and specific rotate/shift instructions - with 32-bit Thumb and 
full ARM instructions it works on a wide range of instructions.

>
>> Every bit of space in the instruction set is important.  Using them to
>> support extra registers may be the best overall tradeoff in some cases,
>> but it is always a tradeoff.
>
> This kind of general talk leads nowhere here as you may have found out
> in previous discussions. If you cannot support a claim by a valid
> particular example you are saying nothing.
>
> I explained that 16 registers are too few for a load/store machine,
> gave an example to make it easier to understand why (to be able to
> compensate for the pipeline delay).
>
> You are just repeating your opinion basing it on nothing.

And you in turn are generalising based on an example and your own very 
specialised experience.  I hope that what I wrote earlier makes it clear 
why I think 16 registers can be an advantage, and why I think your 
example cannot count as a general proof or argument (though I happily 
accept it as an example of when 32 registers is very useful).

Beyond that, I would say that even if 16 registers is not /more/ 
efficient than 32 registers, I have yet to see any reasoning for why 32 
registers is /significantly/ more useful for the type of code generally 
seen in microcontrollers (and yes, I have no choice here but to talk of 
generalities).  A filter algorithm might run faster with more registers 
- but it would do even better by using additional DSP-specific support 
(such as the DSP registers and instructions on the Cortex M4 compared to 
the M3) or SIMD support (Alitvec on PPC, Neon on ARM).

I will try not to repeat myself, but I hope you can see that this is 
necessarily a generalised discussion, based strongly on opinion and 
personal experience - unless you want to give several whole programs 
implemented on at least two architectures using commonly used tools and 
techniques (i.e., C code rather than VLA code) as real evidence.

>
>>> If you become familiar with the power architecture instruction set
>>> you will find very little room for performance improvement, the
>>> person who did it knew what he was doing really well.
>>>
>>
>> It's a nice architecture (apart from the backwards bit numbering!).
>
> The bit numbering is simply big endian. I also
> have had my trouble with it of course but it is easy to get used to.

Yes, it is fine when you are used to it - but very weird to start with, 
and different from everything else (including the numbering used on 
other big-endian processors I have used, such as M68K/ColdFire).  You 
have to double-check everything when you are connecting address line A31 
on the MPC to A0 on the external RAM chip!  And then Freescale's 
documentation freely mixes up 64-bit PPC conventions with the 32-bit 
conventions, so you find you are trying to set "bit 59" of a register by 
writing the value 16...  It was even more "fun" when I found Freescale 
had also mixed it up in a couple of register definitions in a header file.

> In VPA I do use both - crazy as it may seem once you get used to
> think on it it is no longer an issue.
> But overall having big endian bit numbering on a big endian machine
> is the correct thing to do. If one chooses to use a power core in
> little endian mode much of the time one will be somewhat screwed
> I suppose.

To me, "bit 0" is always the least significant bit regardless of the 
endianness of the chip.  But I try not to start wars over it :-)

>
>> ... But
>> it is a very complex architecture - for smaller devices (say 200 MHz,
>> single core - microcontroller class cpus) a PPC core will be much
>> bigger, more difficult to design and work with, and take more power than
>> an ARM (or MIPS) core.
>
> Oh I agree power does not make sense on the smallest of MCUs, of course.
> In fact I don't think it makes much sense below a megabyte of RAM
> or so (but then that's me and is based on my needs so far, I am not
> claiming this to be some general rule).

Fair enough.

>
>>> Not at all. You can always save/restore even just 1 register if you
>>> want to on a 32 register machine, what you *cannot* do on ARM is
>>> have more than 13 registers to use in an IRQ handler when you need
>>> them - so you will need *more* memory accesses in this case.
>>
>> That makes no sense to me.
>
> Well I really cannot simplify the concept of saving say 4 out of 32
> registers, using only them in an IRQ handler, then restoring only
> them and returning from the exception.

When your interrupt function is a leaf function, it's no problem to only 
save the registers you need (regardless of the size of the register set).

But when it is not a leaf function, and you are calling other functions 
(whose code you do not know), you have to follow the calling conventions 
and assume that the called function will destroy all "volatile" 
registers - that's at least 11 extra register saves in a PPC EABI 
system, as well as the link register, CCR, etc.

>
>>> And since the latency which matters is the IRQ latency (tasks get
>>> switched once in milliseconds, IRQ latencies may well have to be
>>> in the few uS range) - the time on save/restore all registers
>>> is negligible (e.g. 32 2.5nS cycles once per mS on a 400 MHz power
>>> core, or 0.008% of the time).
>>>
>>
>> The time needed to save and restore all registers may not be relevant in
>> a given application, but it is not irrelevant or negligible in all cases.
>>
>
> So show us one such case.

On one PPC system I worked with which did not support individual 
interrupt vectors, /all/ registers were saved (and later restored) 
because the vector code was calling unknown external code.  32 x 32-bit 
general purpose registers plus 32 x 64-bit floating point registers, 
stored on external SRAM with 2 cycle accesses meant 400 cycles just for 
the register save and restore - not including the memory bandwidth for 
reading the code or any other processing time.  Even if I had spend time 
optimising it by limiting the storage to the volatile registers (since I 
knew the interrupt functions all followed the EABI conventions 
correctly), and therefore halved the overhead, it would still have been 
very large.

>
> Dimiter
>
> ------------------------------------------------------
> Dimiter Popoff, TGI             http://www.tgi-sci.com
> ------------------------------------------------------
> http://www.flickr.com/photos/didi_tgi/
>
>
>

Reply by David Brown ●January 10, 20152015-01-10

On 09/01/15 16:30, Vladimir Ivanov wrote:
>
> On Fri, 9 Jan 2015, David Brown wrote:
>
>> On 09/01/15 10:54, Vladimir Ivanov wrote:
>>>
>>> On Fri, 9 Jan 2015, David Brown wrote:
>>>
>>>> For microcontrollers, such as the Cortex M devices, I think 16
>>>> registers
>>>> is a good balance for a lot of typical code.
>>>
>>> In Thumb2 you work directly with 8 GP registers, indirectly with few
>>> like PC and SP, and accessing the rest of the GPRs is different and/or
>>> has penalties.
>>
>> As far as I understand it, accessing the other registers means 32-bit
>> instructions rather than the short 16-bit instructions.  So accessing
>> them has penalties compared to accessing the faster registers, but not
>> compared to normal ARM 32-bit instructions.
>
> Yes, longer code sequences, and most likely very limited instruction
> forms. The latter leads to shuffling of data between the regular 8 GPRs
> and the other, "unregular" GPRs.
>

That was needed for Thumb, but not for Thumb2 - you simply use the 
32-bit instructions and have access to the same registers as you would 
with 32-bit ARM codes.  If you like, you can think of Thumb2 as being 
mostly the same as 32-bit ARM (losing a little barrel shifter and 
conditional execution capability) with the addition of 16-bit 
"short-cuts" for the most commonly used instructions.

>>> Just trying to say that it is a moot point. And personally, I never
>>> understood the existence of Cortex-M - why cripple the ability to switch
>>> to native 32-bit mode, if most or all of the underlying logic is there?
>>
>> My knowledge of the details is weak, but AFAIK the only thing you really
>> lose with Thumb2 compared to ARM instruction sets is the conditional
>> execution flags - with ARM, you can use the flags with most
>> instructions, while with Thumb2 you have the if-then-else construction.
>> (You also lose the barrel shifter on some instructions, but that is not
>> going to affect much code.)
>
> I am not Thumb2 expert, either. As a very strong personal (biased)
> opinion, I don't find it elegant at all. MIPS16e impressed me bit more
> with their EXTEND instruction.

I haven't studied either Thumb2 or MIPS16e (or PPC VLE) in detail, but 
they all seem to be a similar solution to a similar problem - making a 
variable-length encoding scheme that is easy to decode, keeps common 
instructions short, but makes it easy to access the full range of the 
cpu's abilities.

>
> What I am trying to communicate, is that the CPU core with all the
> blocks is there. Thumb2 is more or less a decoder, just like the ARM
> mode is. Same with MIPS32 and MIPS16e. Why would one cripple something
> by removing one of the decoders? The power savings are negligible.
>
> ARM7TDMI was more balanced in that regard.

No, the original Thumb instruction set only gave access to some of the 
cpu and let you write significantly slower but more compact code than 
full ARM.  That's why they had to keep the ARM decoder too - if you 
needed fast code, you had to use the full instruction set.  And no one 
considered the mix of two instruction sets to be "balance" - polite 
people called it a pain in the neck.

Thumb2 lets you write code that is about 60% of the size of ARM code, 
and is often /faster/ than 32-bit ARM code, since you can get almost all 
of the functionality while being more efficient on your memory bandwidth 
and caches.

>
>> With the original Thumb, ARM kept the normal 32-bit ARM ISA as well
>> because for some types of code it could be significantly faster.  But
>> with Thumb2, there is almost no code for which the full 32-bit ARM
>> instructions would beat the Thumb2, taking into account the memory
>> bandwidth benefits of Thumb2.
>
> Any pointers to data showing this? Never heard of it so far, and does
> not reflect my experience.
>
> Why'd they include ARM mode at all in the Cortex-A series? :-)

For backwards compatibility.  In Cortex M applications, code is 
generally compiled specifically for the target - so there is no need for 
binary compatibility.  But for Cortex A systems, you regularly have 
pre-compiled code from many sources, and binary compatibility with older 
devices is essential.

Reply by Dimiter_Popoff ●January 11, 20152015-01-11

On 10.1.2015 &#1075;. 20:07, David Brown wrote:
> On 09/01/15 23:04, Dimiter_Popoff wrote:
>> On 09.1.2015 &#1075;. 11:47, David Brown wrote:
>>> On 09/01/15 00:22, Dimiter_Popoff wrote:
>>>> On 09.1.2015 &#1075;. 00:53, Wouter van Ooijen wrote:
>>>>> Dimiter_Popoff schreef op 08-Jan-15 om 11:18 PM:
>>>>>> On 08.1.2015 &#1075;. 23:25, David Brown wrote:
>>>>>>> ...  (Just as "cpus should
>>>>>>> have more than 16 core registers" is a good reason for disliking
>>>>>>> ARM's,
>>>>>>> if that is your opinion.)
>>>>>>
>>>>>> Results of arithmetic calculations are not exactly what I would
>>>>>> call an
>>>>>> opinion.
>>>>>> 16 registers - one of which being reserved for the PC - are too few
>>>>>> for
>>>>>> a load/store machine.
>>>>>
>>>>> 16 is a fact, but the rest is nothing but opinion.
>>>>>
>>>>>   > Clearly it will work but under equal conditions
>>>>>> will be slower than if it had 32 registers, sometimes much slower.
>>>>>
>>>>> More registers can be slower too.
>>>>
>>>> Yes, 32 is about the optimum I suppose. But I have not really analyzed
>>>> that, what I have - and demonstrated by an example which you chose to
>>>> ignore - is the comparison 32 vs. 16.
>>>
>>> Certainly it is possible to pick examples where 32 registers is more
>>> effective than 16 - but equally we can pick examples where 16 registers
>>> is more efficient (such as context switching, or code with a lot of
>>> small functions).  Examples are illustrative, but not proof of a general
>>> rule.
>>
>> You have yet to prove this point. Context switching is not a valid
>> example, as I explained in my former post which you must have read
>> prior to replying to (for those who have not, context switching is
>> responsible for a fraction of a percent of CPU time, consequently
>> halving or even completely eliminating that can bring a fraction of
>> a percent improvement, i.e. it is negligible).
>
> I think it would be wrong to talk about /proving/ points here
 > ...

OK, I really meant "make" your point. Though in technical terms making
a point which cannot be proven is fairly pointless...

> Context switching is not a valid example for /you/, based on the figures
> /you/ gave.

So you understand that these figures are correct - but imply that
the example applies just to me. I thought we could agree at least
on the meaning of numbers.

>I have written systems that had timer interrupts at 100,000
> times per second

Which has *nothing* to do with context switching, if you do that and
save all registers instead of the minimum you have to you just don't
know what you are doing.

I know from threads from years past that you tend to mix up
task scheduling and interrupt processing but please understand
that there is a world of a difference between interrupt processing
and a task switch initiated by an interrupt.

You have not made a point - context switching is *not* a case
where 32 registers can be worse off than 16 in a non-negligible way
(negligible meaning performance cost within say 0.1%, latency-wise
same as 16 or better).

You have yet to give a valid example for what you claim.

> In general, in bigger systems (and PowerPC cores tend to be used in
> bigger systems than most of the embedded cores we see here) you try to
> avoid many interrupts, and prefer DMA and more sophisticated peripherals
> to keep the interrupt rate low.

This is wrong, it is not true that on larger systems interrupts must
be avoided.

You should understand that there is no such animal as "in general" in
engineering. Things we make have to *work*, so we have to go down to
the details.

  For example, on an mpc5200b based system, the one for
my everyday use for programming, emailing etc., running DPS of course,
I have plenty of interrupts all the time - from the display controller
vertical retrace, from the two PS2 ports where the mouse and the
keyboard are connected - each PS2 clock causes and interrupt and
yes, they can come at a faster rate than every 10uS - then there
are the ATA interface interrupts etc. etc.
These have *nothing* to do with task switching, neither do they
initiate one. Say the decrementer interrupt might initiate one and then
of course all regisrters will be shuffled - but *AFTER* the interrupt
has been served and unmasked again so that the interrupt latency
stays low (so a 10uS IRQ rate cannot not impress the machine).

Whatever example you try to come up with you will never find one
which requires saving *all* of the registers with the interrupts
masked - which means there is no performance advantage in having
16 rather than 32 registers, while the opposite is true most if
not all of the time.

> Other than interrupt context switches, there is also function call
> overhead.

Same concept, save/restore only what needs to be saved/restored.
Fewer registers will be as good only as long as you do not need
more registers, then you have to save/restore more *because*
you have fewer registers.
I explained that once already and I am doing it again for you,
please do not make me do it for a third time. Just think and be
willing to understand the obvious.

> I don't know what sort of calling conventions you use in your
> VPA, but for high-level languages there is normally a defined convention
> with some registers being caller-saved (or "volatile") and others being
> callee-saved (or "non-volatile").  The split can vary between compilers
 > ...

This is irrelevant, we are comparing cores. Whether this or that
compiler got some of it basics right or wrong has nothing to do with it.
The fundamental principle - "save/restore only what you have to"
applies in all cases of programming.

IOW if you waste resources by saving/restoring more than you have to
you are doing only a little better than masking all interrupts and
jumping into a " bra *" loop; the ways to destroy something working
are probably infinite, these are only two of them.

>> I would certainly be interested in the example you claim to be able
>> to produce demonstrating how 16 registers can be more effective.
>> My example covers a typical, widespread application - that of a FIR.
>> Let us see yours.
>
> I can't produce a /function/ that is more efficient with 16 registers
> than 32 registers - but I hope that above I have explained how it make a
> difference with chains of small functions (or functions with few
> register demands).

I know you cannot produce an example - the above (much of it clipped)
was irrelevant.

> I think it is reasonable to say that as programs (in the embedded world)
> have got bigger, memories have got bigger, and compilers have got
> better, then the balance for many systems has moved more towards 32
> registers rather than 16 registers, in the same way that it has moved
> towards 32-bit cpus from 8 and 16-bit cpus.

You still do not get it, do you.

32 registers are not just *better*, they are a necessity on a load/store
machine with a pipeline (deep something like 5-6 stages). You just
cannot keep the pipeline full if you have only 16 registers without
stalls because of data dependencies.

> Beyond that, I would say that even if 16 registers is not /more/
> efficient than 32 registers, I have yet to see any reasoning for why 32
> registers is /significantly/ more useful for the type of code generally
> seen in microcontrollers (and yes, I have no choice here but to talk of
> generalities).

My FIR example demonstrated what is about a 3-fold improvement, and
above I just explained - *again* - why. It should be easy to see how
this applies by far not just to FIR but to any computationally
intensive algorithm where data dependencies would kick in.

>  A filter algorithm might run faster with more registers
> - but it would do even better by using additional DSP-specific support

Obviously hardware other than a general purpose core can be built
to do things the core cannot do.

We are comparing the cores here.

> On one PPC system I worked with which did not support individual
> interrupt vectors, /all/ registers were saved (and later restored)
> because the vector code was calling unknown external code.

Poor programming. IRQ routines may never call unknown external
code.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by glen herrmannsfeldt ●January 11, 20152015-01-11

Wouter van Ooijen <wouter@voti.nl> wrote:

(snip)
>> Simon Clubley <clubley@remove_me.eisner.decus.org-earth.ufp> wrote:
>>> On 2015-01-09, Dimiter_Popoff <dp@tgi-sci.com> wrote:
(snip)
>>> In the general case, you have to push and pop all the registers every
>>> time you take an interrupt.

(snip,then I wrote)
>> Some processors have multiple register sets that might avoid that.
>> SPARC has register windows, such that they don't have to save to
>> memory until all the windows are in use.  I don't know if that is
>> for interrupts, too.

> That is a nice concept, but it has two problems:

> - when you overflow the avaiable resister sets, you must spill to 
> memory. whether you need to do this, depends on where you are in the 
> register set. this makes timing difficult to predict, which is not nice 
> for a real-time system

Seems that they thought of that.

> - imagine a context switch. now you have to save/restore all 
> register sets!

and that.

> In general, fat-context CPUs are better at single-threaded no-interrupt 
> applications, but worse for switch-often interrupt-heavy applications.

http://www.gaisler.com/doc/sparcv8.pdf

Section D.8 explains that one. For the more usual case where function
calls are much more common than context switch, and you want to minimize
total overhead, more register windows are better. 

For SPARC, there is a register mask that controls how many are used
by user mode code. Changing that allows supervisor code to use some,
such that one can optimize overall between user and supervisor code.

And finally, they consider the case where context switch time is
most important. In that case, they allocate register windows between
tasks, such that one can do task switch by only changing the register
mask and current window, and not storing any to memory.  
The OS has complete control over how the windows are used.

-- glen

Reply by Simon Clubley ●January 11, 20152015-01-11

On 2015-01-10, Dimiter_Popoff <dp@tgi-sci.com> wrote:
>
> I am still somewhat amazed at what was said.

That just shows the disconnect between what you do and what the rest
of the world does. :-)

> What on earth is there to stop people from doing something
> *that* simple - here are two IRQ handlers on a small MCU, I used
> an mcf52211 a couple of months back to make a HV source - it
> does the PWM/regulating, overcurrent protection/limiting, serial
> communication etc., all in all 4 tasks, several IRQs. Took me
> about 2 weeks to program (I had hoped it would take 2 days but
> I had completely forgotten the insides of the 52211 so I had to
> recall a lot which is where the 2 weeks went). A total of about
> 250k sources, the object code being almost 9 kilobytes.
>
> So here are two IRQ handlers where hopefully it is obvious how
> only what is needed is saved and restored, pretty basic stuff:
>
> http://tgi-sci.com/misc/hvst0q.gif
>
> This is not VPA, just plain 68k (well, CF) assembly, so it is
> far from being taken from my own world.
>

While the actual wrapper around the device specific interrupt handler
(ie: the generic IRQ code which runs before you get to the device
specific handler itself) is generally still assembly language, most
people don't write the actual device specific handler in assembly
language any more, but use a higher level language such as C instead.

Once you do that, the IRQ wrapper needs to save all the registers the
C compiler could potentially use, including all the temporary registers,
before the wrapper calls the device specific handler.

> I just wonder how hopeless things must have become to question
> the viability of doing something that basic.
>

Different set of tradeoffs. The higher level language code can
potentially be reused on multiple architectures (or if that's not
possible in a specific case, can at least used as the starting point
for another driver); your PowerPC specific assembly language code cannot
be reused in such a way.

What works for you in your restricted environment doesn't work when
you need your code to work in a generic environment across a wide range
of architectures.

Simon.

-- 
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world

Reply by Simon Clubley ●January 11, 20152015-01-11

On 2015-01-09, Anders.Montonen@kapsi.spam.stop.fi.invalid <Anders.Montonen@kapsi.spam.stop.fi.invalid> wrote:
>
> microMIPS has instructions for pushing and popping the callee-save 
> registers onto the stack (LWM32/LWM16/SWM32/SWM16). This is notable in a 
> way because MIPS have traditionally avoided committing an ABI into the 
> architecture.
>

I didn't know that. Thanks.

Simon.

-- 
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world

Previous 4 567 8 9 Next

Integrated TFT controller in PIC MCUs

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group