Integrated TFT controller in PIC MCUs| page 4

Reply by John Devereux ●January 9, 20152015-01-09

Vladimir Ivanov <none@none.tld> writes:

> On Fri, 9 Jan 2015, David Brown wrote:
>
>> On 09/01/15 10:54, Vladimir Ivanov wrote:
>>>
>>> On Fri, 9 Jan 2015, David Brown wrote:
>>>
>>>> For microcontrollers, such as the Cortex M devices, I think 16 registers
>>>> is a good balance for a lot of typical code.
>>>
>>> In Thumb2 you work directly with 8 GP registers, indirectly with few
>>> like PC and SP, and accessing the rest of the GPRs is different and/or
>>> has penalties.
>>
>> As far as I understand it, accessing the other registers means 32-bit
>> instructions rather than the short 16-bit instructions.  So accessing
>> them has penalties compared to accessing the faster registers, but not
>> compared to normal ARM 32-bit instructions.
>
> Yes, longer code sequences, and most likely very limited instruction
> forms. The latter leads to shuffling of data between the regular 8
> GPRs and the other, "unregular" GPRs.
>
>>> Just trying to say that it is a moot point. And personally, I never
>>> understood the existence of Cortex-M - why cripple the ability to switch
>>> to native 32-bit mode, if most or all of the underlying logic is there?
>>
>> My knowledge of the details is weak, but AFAIK the only thing you really
>> lose with Thumb2 compared to ARM instruction sets is the conditional
>> execution flags - with ARM, you can use the flags with most
>> instructions, while with Thumb2 you have the if-then-else construction.
>> (You also lose the barrel shifter on some instructions, but that is not
>> going to affect much code.)
>
> I am not Thumb2 expert, either. As a very strong personal (biased)
> opinion, I don't find it elegant at all. MIPS16e impressed me bit more
> with their EXTEND instruction.
>
> What I am trying to communicate, is that the CPU core with all the
> blocks is there. Thumb2 is more or less a decoder, just like the ARM
> mode is. Same with MIPS32 and MIPS16e. Why would one cripple something
> by removing one of the decoders? The power savings are negligible.
>
> ARM7TDMI was more balanced in that regard.

I have used both, cortex M3/M4 is just much nicer to program. The code
is compact, and faster clock-for-clock than even 32-bit ARM7 code. No
more convoluted assembly language wrappers everywhere, no "thumb
interworking", "GLUE7" segments, no half a dozen system modes+stacks to
worry about.

>
>> With the original Thumb, ARM kept the normal 32-bit ARM ISA as well
>> because for some types of code it could be significantly faster.  But
>> with Thumb2, there is almost no code for which the full 32-bit ARM
>> instructions would beat the Thumb2, taking into account the memory
>> bandwidth benefits of Thumb2.
>
> Any pointers to data showing this? Never heard of it so far, and does
> not reflect my experience.
>
> Why'd they include ARM mode at all in the Cortex-A series? :-)

-- 

John Devereux

Reply by ●January 9, 20152015-01-09

Vladimir Ivanov <none@none.tld> wrote:

> What I am trying to communicate, is that the CPU core with all the blocks 
> is there. Thumb2 is more or less a decoder, just like the ARM mode is. 
> Same with MIPS32 and MIPS16e. Why would one cripple something by removing 
> one of the decoders? The power savings are negligible.

The tagline for Thumb-2 is the performance of ARM with the code size of 
Thumb. It would be interesting to see a comprehensive benchmark 
comparing the two. The best I've found so far is an ARM presentation 
with some numbers for the EEMBC benchmarks[1], which shows Thumb-2 
having 98% of the performance.

-a

[1] <http://elinux.org/images/8/8a/Experiment_with_Linux_and_ARM_Thumb-2_ISA.pdf>

Reply by Tauno Voipio ●January 9, 20152015-01-09

On 9.1.15 17:30, Vladimir Ivanov wrote:

>> As far as I understand it, accessing the other registers means 32-bit
>> instructions rather than the short 16-bit instructions.  So accessing
>> them has penalties compared to accessing the faster registers, but not
>> compared to normal ARM 32-bit instructions.
>
> Yes, longer code sequences, and most likely very limited instruction
> forms. The latter leads to shuffling of data between the regular 8 GPRs
> and the other, "unregular" GPRs.
>

This applies partially to old Thumb. Thumb2 is still shorter than 32 bit 
ARM code for the same task. The cost of r8-r15 use is two bytes in most 
instructions, but we are only in the length of regular 32-bit code in 
these expensive forms.

-- 

-Tauno Voipio

Reply by ●January 9, 20152015-01-09

Vladimir Ivanov <none@none.tld> wrote:
> On Fri, 9 Jan 2015, Simon Clubley wrote:

>> Do current versions of the MIPS ISA allow you to push a set of registers
>> onto the stack in one instruction as you can with ARM or do you still have
>> to push (and pop) them one after the other manually in your handlers ?
> No, because it is a load/store architecture. I don't really have any 
> experience with the new MicroMIPS, but not expecting that to change.

microMIPS has instructions for pushing and popping the callee-save 
registers onto the stack (LWM32/LWM16/SWM32/SWM16). This is notable in a 
way because MIPS have traditionally avoided committing an ABI into the 
architecture.

> MIPS16e, as present in the PIC32MX (MIPS M4K core), is comparable to 
> Thumb2. 
> MicroMIPS, as present in the newer PIC32MZ (MIPS 14K core), is even better 
> than MIPS16e.
> 
> One benefit is that you can always switch to MIPS32 mode for performance 
> reasons, unlike the pure Thumb2 MCUs, like Cortex-M.

MIPS16e is much closer to Thumb. You only have a subset of the 
registers available, and no system control instructions. microMIPS is 
comparable to Thumb-2, and the idea is the same. Shrink the code size 
while retaining performance.
 MIPS32 support is optional for cores that support microMIPS. In fact, 
the latest version of Microchip's XC32 compiler includes support for an 
unreleased PIC32MM family which only supports microMIPS.

-a

Reply by Dimiter_Popoff ●January 9, 20152015-01-09

On 09.1.2015 &#1075;. 11:47, David Brown wrote:
> On 09/01/15 00:22, Dimiter_Popoff wrote:
>> On 09.1.2015 &#1075;. 00:53, Wouter van Ooijen wrote:
>>> Dimiter_Popoff schreef op 08-Jan-15 om 11:18 PM:
>>>> On 08.1.2015 &#1075;. 23:25, David Brown wrote:
>>>>> ...  (Just as "cpus should
>>>>> have more than 16 core registers" is a good reason for disliking ARM's,
>>>>> if that is your opinion.)
>>>>
>>>> Results of arithmetic calculations are not exactly what I would call an
>>>> opinion.
>>>> 16 registers - one of which being reserved for the PC - are too few for
>>>> a load/store machine.
>>>
>>> 16 is a fact, but the rest is nothing but opinion.
>>>
>>>   > Clearly it will work but under equal conditions
>>>> will be slower than if it had 32 registers, sometimes much slower.
>>>
>>> More registers can be slower too.
>>
>> Yes, 32 is about the optimum I suppose. But I have not really analyzed
>> that, what I have - and demonstrated by an example which you chose to
>> ignore - is the comparison 32 vs. 16.
>
> Certainly it is possible to pick examples where 32 registers is more
> effective than 16 - but equally we can pick examples where 16 registers
> is more efficient (such as context switching, or code with a lot of
> small functions).  Examples are illustrative, but not proof of a general
> rule.

You have yet to prove this point. Context switching is not a valid
example, as I explained in my former post which you must have read
prior to replying to (for those who have not, context switching is
responsible for a fraction of a percent of CPU time, consequently
halving or even completely eliminating that can bring a fraction of
a percent improvement, i.e. it is negligible).

I would certainly be interested in the example you claim to be able
to produce demonstrating how 16 registers can be more effective.
My example covers a typical, widespread application - that of a FIR.
Let us see yours.

>>> More registers => more bits in an instruction to specify a register =>
>>> more to read from code mempry => slower
>>
>> Not really. Load/store machines typically  have a fixed 32 bit
>> instruction word; using that for only 16 registers is simply
>> waste of space, you will have less information packed in the
>> opcode thus you will need more opcodes to do the same job, thus
>> having to do *more* memory fetches.
>
> Many load/store machines have some sort of compressed or limited
> instruction format for greater efficiency (especially of instruction
> cache).  ARM has a couple of "thumb" modes, MIPs have an equivalent, and
> on the PPC I have seen various schemes.

I'd be interested to see those sub-32 bit opcode schemes you talk about
on power, I have yet to encounter one of them. Certainly none of them
is present on the cores I use or have investigated. (I am just curious,
not in need of something like that).

> Even in the full 32-bit ARM set, the bits "saved" by having fewer
> registers are used for the conditional execution bits and the barrel
> shifter.

Can you please elaborate on that. What can ARM do using the barrel
shifter which you cannot do using the likes of rlwinm, rlwnm or rlwimi
on power?

> Every bit of space in the instruction set is important.  Using them to
> support extra registers may be the best overall tradeoff in some cases,
> but it is always a tradeoff.

This kind of general talk leads nowhere here as you may have found out
in previous discussions. If you cannot support a claim by a valid
particular example you are saying nothing.

I explained that 16 registers are too few for a load/store machine,
gave an example to make it easier to understand why (to be able to
compensate for the pipeline delay).

You are just repeating your opinion basing it on nothing.

>> If you become familiar with the power architecture instruction set
>> you will find very little room for performance improvement, the
>> person who did it knew what he was doing really well.
>>
>
> It's a nice architecture (apart from the backwards bit numbering!).

The bit numbering is simply big endian. I also
have had my trouble with it of course but it is easy to get used to.
In VPA I do use both - crazy as it may seem once you get used to
think on it it is no longer an issue.
But overall having big endian bit numbering on a big endian machine
is the correct thing to do. If one chooses to use a power core in
little endian mode much of the time one will be somewhat screwed
I suppose.

>... But
> it is a very complex architecture - for smaller devices (say 200 MHz,
> single core - microcontroller class cpus) a PPC core will be much
> bigger, more difficult to design and work with, and take more power than
> an ARM (or MIPS) core.

Oh I agree power does not make sense on the smallest of MCUs, of course.
In fact I don't think it makes much sense below a megabyte of RAM
or so (but then that's me and is based on my needs so far, I am not
claiming this to be some general rule).

>> Not at all. You can always save/restore even just 1 register if you
>> want to on a 32 register machine, what you *cannot* do on ARM is
>> have more than 13 registers to use in an IRQ handler when you need
>> them - so you will need *more* memory accesses in this case.
>
> That makes no sense to me.

Well I really cannot simplify the concept of saving say 4 out of 32
registers, using only them in an IRQ handler, then restoring only
them and returning from the exception.

>> And since the latency which matters is the IRQ latency (tasks get
>> switched once in milliseconds, IRQ latencies may well have to be
>> in the few uS range) - the time on save/restore all registers
>> is negligible (e.g. 32 2.5nS cycles once per mS on a 400 MHz power
>> core, or 0.008% of the time).
>>
>
> The time needed to save and restore all registers may not be relevant in
> a given application, but it is not irrelevant or negligible in all cases.
>

So show us one such case.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by ●January 9, 20152015-01-09

Dimiter_Popoff <dp@tgi-sci.com> wrote:
> I'd be interested to see those sub-32 bit opcode schemes you talk about
> on power, I have yet to encounter one of them. Certainly none of them
> is present on the cores I use or have investigated. (I am just curious,
> not in need of something like that).

IBM has a scheme called CodePack, where compressed code is unpacked into 
L1 cache on misses. It's an interesting (and very different) approach, 
but it loses out on the improved cache utilization of schemes like 
Thumb-2 and microMIPS.

<http://web.eecs.umich.edu/~tnm/papers/micro99.pdf>
<http://researcher.watson.ibm.com/researcher/files/us-lefurgy/micro32-slides.pdf>

-a

Reply by Dimiter_Popoff ●January 9, 20152015-01-09

On 10.1.2015 &#1075;. 01:31, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:
> Dimiter_Popoff <dp@tgi-sci.com> wrote:
>> I'd be interested to see those sub-32 bit opcode schemes you talk about
>> on power, I have yet to encounter one of them. Certainly none of them
>> is present on the cores I use or have investigated. (I am just curious,
>> not in need of something like that).
>
> IBM has a scheme called CodePack, where compressed code is unpacked into
> L1 cache on misses. It's an interesting (and very different) approach,
> but it loses out on the improved cache utilization of schemes like
> Thumb-2 and microMIPS.
>
> <http://web.eecs.umich.edu/~tnm/papers/micro99.pdf>
> <http://researcher.watson.ibm.com/researcher/files/us-lefurgy/micro32-slides.pdf>
>
> -a
>

Thanks, I had never seen that.
Looks very different indeed - and interesting but probably impractical.

The comparisons are practically comparisons between the efficiency
of the respective compilers/compiler libraries rather than of the
machines.

The fact that a 16k instruction cache can be too short for some code
demonstrates only how poor the code is - which is the norm nowadays,
of course (and the reason why they try to defeat the messy programming
by building hardware on top of it, which can only work short term but
here we are).

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by Simon Clubley ●January 9, 20152015-01-09

On 2015-01-09, Dimiter_Popoff <dp@tgi-sci.com> wrote:
>
> Well I really cannot simplify the concept of saving say 4 out of 32
> registers, using only them in an IRQ handler, then restoring only
> them and returning from the exception.
>

In the general case, you have to push and pop all the registers every
time you take an interrupt.

In your world you may not have to always do that, but for the general
purpose case when you don't have absolute control of the code being
called from the handler you do.

Even when you control the code being called from the handler, you still
have to push all the registers the code could potentially use if it's
written in a high level language.

Or to put this another way, your usage model when it comes to interrupt
handlers is not the general usage model that most other people have to
work with. :-)

Simon.

-- 
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world

Reply by Dimiter_Popoff ●January 9, 20152015-01-09

On 10.1.2015 &#1075;. 03:36, Simon Clubley wrote:
> On 2015-01-09, Dimiter_Popoff <dp@tgi-sci.com> wrote:
>>
>> Well I really cannot simplify the concept of saving say 4 out of 32
>> registers, using only them in an IRQ handler, then restoring only
>> them and returning from the exception.
>>
>
> In the general case, you have to push and pop all the registers every
> time you take an interrupt.

!?

In the general case you do not, if you are the programmer.

> In your world you may not have to always do that, but for the general
> purpose case when you don't have absolute control of the code being
> called from the handler you do.
> Even when you control the code being called from the handler, you still
> have to push all the registers the code could potentially use if it's
> written in a high level language.

Of course you can program any machine to a complete halt. Or just use
a hammer to smash it, this will perhaps be an easier way.

> Or to put this another way, your usage model when it comes to interrupt
> handlers is not the general usage model that most other people have to
> work with. :-)

Well if programming has deteriorated by *such* a degree I really
do not have many people to converse with about programming, this much
is obvious :-).

But this does not change the validity of the concept "save/restore only
what you have to" when applied in the core to core comparison context.

My God, I really did not think things had gone *that* bad.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

Reply by glen herrmannsfeldt ●January 9, 20152015-01-09

Simon Clubley <clubley@remove_me.eisner.decus.org-earth.ufp> wrote:
> On 2015-01-09, Dimiter_Popoff <dp@tgi-sci.com> wrote:

>> Well I really cannot simplify the concept of saving say 4 out of 32
>> registers, using only them in an IRQ handler, then restoring only
>> them and returning from the exception.
 
> In the general case, you have to push and pop all the registers every
> time you take an interrupt.

Some processors have multiple register sets that might avoid that.
SPARC has register windows, such that they don't have to save to
memory until all the windows are in use.  I don't know if that is
for interrupts, too.
 
> In your world you may not have to always do that, but for the general
> purpose case when you don't have absolute control of the code being
> called from the handler you do.

-- glen