On 16/01/15 20:30, Vladimir Ivanov wrote:
>
> On Sat, 10 Jan 2015, David Brown wrote:
>
>> On 09/01/15 16:30, Vladimir Ivanov wrote:
>>>
>>> On Fri, 9 Jan 2015, David Brown wrote:
>>>
>>>> On 09/01/15 10:54, Vladimir Ivanov wrote:
>>>>>
>>>>> On Fri, 9 Jan 2015, David Brown wrote:
>>>>>
>>>>>> For microcontrollers, such as the Cortex M devices, I think 16
>>>>>> registers
>>>>>> is a good balance for a lot of typical code.
>>>>>
>>>>> In Thumb2 you work directly with 8 GP registers, indirectly with few
>>>>> like PC and SP, and accessing the rest of the GPRs is different and/or
>>>>> has penalties.
>>>>
>>>> As far as I understand it, accessing the other registers means 32-bit
>>>> instructions rather than the short 16-bit instructions.  So accessing
>>>> them has penalties compared to accessing the faster registers, but not
>>>> compared to normal ARM 32-bit instructions.
>>>
>>> Yes, longer code sequences, and most likely very limited instruction
>>> forms. The latter leads to shuffling of data between the regular 8 GPRs
>>> and the other, "unregular" GPRs.
>>>
>>
>> That was needed for Thumb, but not for Thumb2 - you simply use the
>> 32-bit instructions and have access to the same registers as you would
>> with 32-bit ARM codes.  If you like, you can think of Thumb2 as being
>> mostly the same as 32-bit ARM (losing a little barrel shifter and
>> conditional execution capability) with the addition of 16-bit
>> "short-cuts" for the most commonly used instructions.
>
> Did they make everything orthogonal and only a matter of instruction
> size? I have to recheck this, have forgotten most of it already.

No, it is not entirely orthogonal.  In particular, common combinations 
will have 16-bit Thumb2 instructions, while less common combinations 
will have 32-bit Thumb2 instructions.  For example, in the ARM, like in 
most RISC architectures, there is not a dedicated stack register - you 
simply use one of the general registers along with appropriate post and 
pre increment and decrement addressing modes.  But by convention, and 
codified in the ABI, one of the registers (r13 on the ARM, IIRC) is 
always used as the stack pointer.  Instructions using r13 for these 
sorts of addressing modes will be common in the 16-bit Thumb2 encodings, 
but use of the same modes with other registers probably needs 32-bit 
Thumb2 encodings.  (I haven't confirmed the details of this with the ARM 
documentation, but the principle is accurate.)  The same thing applies 
to similar shorted encodings on other processors.

>
>>> What I am trying to communicate, is that the CPU core with all the
>>> blocks is there. Thumb2 is more or less a decoder, just like the ARM
>>> mode is. Same with MIPS32 and MIPS16e. Why would one cripple something
>>> by removing one of the decoders? The power savings are negligible.
>>>
>>> ARM7TDMI was more balanced in that regard.
>>
>> No, the original Thumb instruction set only gave access to some of the
>> cpu and let you write significantly slower but more compact code than
>> full ARM.  That's why they had to keep the ARM decoder too - if you
>> needed fast code, you had to use the full instruction set.  And no one
>> considered the mix of two instruction sets to be "balance" - polite
>> people called it a pain in the neck.
>
> :-)
>
>> Thumb2 lets you write code that is about 60% of the size of ARM code,
>> and is often /faster/ than 32-bit ARM code, since you can get almost
>> all of the functionality while being more efficient on your memory
>> bandwidth and caches.
>
> Is this still valid for the big OoO/superscalar cores?

Yes.  The actual balance between size ratios and speed ratios varies a 
little depending on the type of code being run, but I a gather that ARM 
and Thumb2 encodings are typically within a few percent of the same 
speed, while Thumb2 code size is between 60% and 80%.  For bigger cpus, 
the processing speed exceeds the memory speed by a greater amount - they 
are likely to gain more overall speed due to Thumb2 than on smaller cpus.

>
>>>> With the original Thumb, ARM kept the normal 32-bit ARM ISA as well
>>>> because for some types of code it could be significantly faster.  But
>>>> with Thumb2, there is almost no code for which the full 32-bit ARM
>>>> instructions would beat the Thumb2, taking into account the memory
>>>> bandwidth benefits of Thumb2.
>>>
>>> Any pointers to data showing this? Never heard of it so far, and does
>>> not reflect my experience.
>>>
>>> Why'd they include ARM mode at all in the Cortex-A series? :-)
>>
>> For backwards compatibility.  In Cortex M applications, code is
>> generally compiled specifically for the target - so there is no need
>> for binary compatibility.  But for Cortex A systems, you regularly
>> have pre-compiled code from many sources, and binary compatibility
>> with older devices is essential.
>
> That's an interesting angle. Thanks for the comments, I will investigate
> some more.

There is also the case that there may be types of code that will be 
noticeably faster in ARM encoding than Thumb2.  On a Cortex-A cpu, 
including the ARM decoder is a tiny percentage of die size, and can be 
powered-down when not in use - it is therefore cheap to add if people 
want to use it.  For smaller devices like Cortex-M microcontrollers, the 
size of an ARM decoder (in addition to Thumb2) would be a much bigger 
percentage of the die size, and therefore a bigger fraction of the cost.

On Sat, 10 Jan 2015, David Brown wrote:

> On 09/01/15 16:30, Vladimir Ivanov wrote:
>> 
>> On Fri, 9 Jan 2015, David Brown wrote:
>> 
>>> On 09/01/15 10:54, Vladimir Ivanov wrote:
>>>> 
>>>> On Fri, 9 Jan 2015, David Brown wrote:
>>>> 
>>>>> For microcontrollers, such as the Cortex M devices, I think 16
>>>>> registers
>>>>> is a good balance for a lot of typical code.
>>>> 
>>>> In Thumb2 you work directly with 8 GP registers, indirectly with few
>>>> like PC and SP, and accessing the rest of the GPRs is different 
>>>> and/or
>>>> has penalties.
>>> 
>>> As far as I understand it, accessing the other registers means 32-bit
>>> instructions rather than the short 16-bit instructions.  So accessing
>>> them has penalties compared to accessing the faster registers, but 
>>> not
>>> compared to normal ARM 32-bit instructions.
>> 
>> Yes, longer code sequences, and most likely very limited instruction
>> forms. The latter leads to shuffling of data between the regular 8 
>> GPRs
>> and the other, "unregular" GPRs.
>> 
>
> That was needed for Thumb, but not for Thumb2 - you simply use the 
> 32-bit instructions and have access to the same registers as you would 
> with 32-bit ARM codes.  If you like, you can think of Thumb2 as being 
> mostly the same as 32-bit ARM (losing a little barrel shifter and 
> conditional execution capability) with the addition of 16-bit 
> "short-cuts" for the most commonly used instructions.

Did they make everything orthogonal and only a matter of instruction size? 
I have to recheck this, have forgotten most of it already.

>> What I am trying to communicate, is that the CPU core with all the
>> blocks is there. Thumb2 is more or less a decoder, just like the ARM
>> mode is. Same with MIPS32 and MIPS16e. Why would one cripple something
>> by removing one of the decoders? The power savings are negligible.
>> 
>> ARM7TDMI was more balanced in that regard.
>
> No, the original Thumb instruction set only gave access to some of the 
> cpu and let you write significantly slower but more compact code than 
> full ARM.  That's why they had to keep the ARM decoder too - if you 
> needed fast code, you had to use the full instruction set.  And no one 
> considered the mix of two instruction sets to be "balance" - polite 
> people called it a pain in the neck.

:-)

> Thumb2 lets you write code that is about 60% of the size of ARM code, 
> and is often /faster/ than 32-bit ARM code, since you can get almost all 
> of the functionality while being more efficient on your memory bandwidth 
> and caches.

Is this still valid for the big OoO/superscalar cores?

>>> With the original Thumb, ARM kept the normal 32-bit ARM ISA as well
>>> because for some types of code it could be significantly faster.  But
>>> with Thumb2, there is almost no code for which the full 32-bit ARM
>>> instructions would beat the Thumb2, taking into account the memory
>>> bandwidth benefits of Thumb2.
>> 
>> Any pointers to data showing this? Never heard of it so far, and does
>> not reflect my experience.
>> 
>> Why'd they include ARM mode at all in the Cortex-A series? :-)
>
> For backwards compatibility.  In Cortex M applications, code is 
> generally compiled specifically for the target - so there is no need for 
> binary compatibility.  But for Cortex A systems, you regularly have 
> pre-compiled code from many sources, and binary compatibility with older 
> devices is essential.

That's an interesting angle. Thanks for the comments, I will investigate 
some more.

On Sat, 10 Jan 2015, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

> Vladimir Ivanov <none@none.tld> wrote:
>> On Fri, 9 Jan 2015, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:
>
>>> MIPS32 support is optional for cores that support microMIPS. In fact,
>>> the latest version of Microchip's XC32 compiler includes support for an
>>> unreleased PIC32MM family which only supports microMIPS.
>> Now that you mention it, I remember seeing pointers about future PIC32MM
>> stuck to microMIPS only. Again marketing pressure?
>
> As far as I can tell from the header files and compiler source code, the 
> PIC32MM could be a replacement/follow-up for the PIC32MX1xx/2xx. There's 
> no DSP ASE, and no shadow registers, so it's clearly not a high- 
> performance chip, and it doesn't seem like it has any special 
> peripherals either. Using microMIPS at the low end makes sense, as you 
> can fit more code in a smaller flash. I don't know how much silicon area 
> is saved by having only the one instruction set, but that kind of makes 
> sense for a low-end chip as well.

They will shave some Flash space from the MIPS16e -> microMIPS transition, 
but that won't be revolutionary. But the MIPS32 -> microMIPS will be 
noticeable, yes. Maybe MIPS16e is not that popular after all.

The silicon savings of MM's core are probably close to none, I think this 
is mostly for the user's comfort of staying into a single mode and having 
a distinguished Thumb-2 competitor. I wouldn't be surprised if the MIPS32 
decoder is present in the macro cell, just fused/disabled. It is only a 
speculation, of course, but keeping less cores is a sane choice.

Still, the MM might be interesting.

On 14/01/15 15:41, Dimiter_Popoff wrote:
> On 14.1.2015 &#1075;. 16:05, David Brown wrote:
>> On 14/01/15 14:14, Dimiter_Popoff wrote:
>>> On 14.1.2015 &#1075;. 14:54, David Brown wrote:
>>>> On 14/01/15 13:14, Dimiter_Popoff wrote:
>>>>> On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote:
>>>>>> On 14/01/15 02:11, Dimiter_Popoff wrote:
>>>>>>> So what is the guaranteed IRQ latency on your ARM core of choice
>>>>>>> running linux with some SATA drives, multiple windows, ethernet,
>>>>>>> some serial interfaces. Try to give some figure - please notice
>>>>>>> the word "guaranteed", I know how much the linux crowd prefers
>>>>>>> to talk "in general".
>>>>>>
>>>>>> Having L1/L2/L3 caches will instantly introduce a high variation
>>>>>> between the mean and max latencies. Even for i486s with their
>>>>>> minimal cache and no operating system, a 10:1 variability was
>>>>>> visible.
>>>>>
>>>>> Yes, though on some processors one has the ability to lock part of the
>>>>> L1 cache - which allows to have it dedicated to interrupts which can
>>>>> make things a lot tighter (by saving the necessity to update entire
>>>>> cachelines).
>>>>>
>>>>> Overall the latency variability obviously increases as processor
>>>>> sizes increase but then total execution times decrease, memories
>>>>> get faster etc.  so the worst case latency can still be very low.
>>>>> On the 5200b which I use I have never needed to resort to any
>>>>> cache locks etc., all I do is just stay masked only as absolutely
>>>>> necessary.
>>>>>
>>>>>> Any variability to do with register saving will be completely
>>>>>> insignificant compared to the effects of caches. Unless, of
>>>>>> course, you are having to dump the entire hidden state of
>>>>>> an Itanic processor :)
>>>>>>
>>>>>
>>>>> Well we have not come to that obvious point yet I am afraid :-).
>>>>> Let us first have the figure on the worst-case linux IRQ latency
>>>>> I asked for then put into its context the try of ARM/linux
>>>>> devotees about lower latency by not having enough registers :-).
>>>>>
>>>>> Dimiter
>>>>>
>>>>
>>>> Neither you nor anyone else can give worst-case IRQ latencies for Linux
>>>> running on PPC, MIPS, ARM, x86 or anything else - there is too much
>>>> variation.
>>>
>>> This answer means it is infinite - nice figure in the context of
>>> saving a few registers, no doubt about that. Am I supposed to
>>> laugh or to cry.
>>
>> You are supposed to use something other than standard Linux (or Windows)
>> when you need hard real time.  If you really need to use Linux and you
>> also really need real time, then you can use one of several real-time
>> extensions to Linux which will give you a high (compared to dedicated
>> RTOS's and more suitable hardware) but definitely not infinite maximum
>> latency.
>>
>> Of course, since you sell a real-time system which /does/ have
>> guaranteed worst-case latencies, obviously you should be laughing :-)
>>
>>>
>>> I can give a figure for DPS - and guarantee it, commercially.
>>> As an OS DPS is meanwhile no smaller than linux - just the applications
>>> written for it are much much fewer. VM, windows, filesystem, networking
>>> etc., it is all in there.
>>
>> I am sure DPS has lots of useful and important features - including
>> everything you and your customers need.  But I am also sure it /is/
>> smaller than Linux (which is currently at about 17e6 lines for the
>> kernel alone) - the comparison is not useful.
> 
> Oh but it is - if we compare the OS itself, not the applications.
> Meaning what you as a programmer will have as functionality via
> system calls. 17e6 lines of wasteful programming could well be
> less than mine 1.7e6 lines (not sure about the exact figure),
> hard to say. Does their kernel include the support for windows,
> offscreen buffers, graphics draw calls etc.?
> 
>>  Comparing to vxworks,
>> QNX, RTEMS, etc., would make more sense.
> 
> Do these come with all the features like windows, VM, filesystem,
> networking?

I am not sure that this is the best place to give a beginners course on
Linux, QNX, RTEMS, or operating systems in general.  Obviously you have
vast experience with DPS - yet your questions show a lack of knowledge
of how these sorts of OS's are built up and structured.  I can't tell if
you really know so little about what Linux is, and what an OS kernel is,
or if you are being intentionally na&iuml;ve - I have no wish to sound
patronising and write about things you have worked with every day for
twenty years, but equally I am happy to explain things if it is helpful
to you.  Can I just say you should read the Wikipedia articles plus each
project's home page, and if we need to go further then we'll take it
from there?

> 
>>> And I do have a figure for the latency.
>>> So this figure for linux is infinity?
>>>
>>
>> Unless you have calculated it, or at least measured it to a desired
>> statistical level of accuracy, then by the definition of "worst case",
>> it is infinite.  (You might prefer to say "real time" requires
>> calculation, not just measurement - but that gets increasingly difficult
>> for more complex systems.  If your tests suggest that missing a timing
>> deadline is statistically less likely than being struck by lightning,
>> that is often good enough.)
> 
> Measuring is OK, calculating is not just difficult, it can be outright
> impractical nowadays. One should do it to get a ballpark figure what to
> expect then measure it - over a long enough time the worst case response
> is not so hard to measure, provided you know what is going on.
> 

Agreed.

>> One report I found with Google is for an 800 MHz Cortex A8 chip with
>> kernel 2.6.31, testing with and without the "real time" patch (this is
>> not a "real-time extension" to Linux, which work in a different way -
>> basically the "real-time patch" sacrifices total throughput but allows
>> most system calls and functions to be pre-emptable).  Without the
>> "real-time patch", maximum measured latencies were 2465 us - with the
>> patch, the maximum measured latency was 58 us.
> 
> Well 58uS is still OK, only about 5 times (or is it 10 times, I am
> not sure whether the 10 uS figure was not on a 200 MHz machine)
> worse than DPS at a 400 MHz power (mpc5200b). The question why
> is this real time patch not universally applied remains of course,
> how much of the functionality do they have to sacrifice if they
> use it.

In Linux, interrupts get passed on to kernel interrupt threads, and thus
involve a (limited) context switch.  That is always going to be more
costly than handling the interrupt directly, but allows the interrupt
code more access to kernel functions.

As far as I understand the RT patch, there are two issues regarding
universal application in the kernel.  One is that improving worst-case
response times means minimising the size of critical sections with
interrupts disabled.  The other is that much more of the kernel is
preemptable and re-entrant, and uses finer grain locking.  So code that
used to be "get lock, do A, B, C, release lock" might be changed to "get
lock, do A, release lock, do B, get lock, do C, release lock".  The
locked (or interrupt disabled) sections are shorter, but total
throughput is reduced as there is more overhead in the locking.  In
particular, I gather than most spin-locks (which are very fast at taking
a free lock) are replaced by mutexes with priority inheritance.

Certainly some aspects have moved into the main kernel - modern Linux
kernels have a lot finer grained locking than older ones, which tended
to use the "big kernel lock" a great deal.  The main motivation here is
for SMP systems - when Linux systems were generally on one core, a
single "large" lock was okay, but with multiple cores it gets very
inefficient.

Other aspects are configurable (as are many things in Linux) - you often
want a different balance between throughput and response times for
server systems, desktops, and embedded systems.

> 
> I asked for this figure only to put into its context the claim about
> the "need" to save all 32 registers. So let us see - saving 16
> registers more to say the slower of the two DDRAM-s, the one on the
> 400 MHz 5200b (assuming a complete cache miss), 133 MHz clocked
> DDRAM, which does something like a 10nS per .l IIRC on average
> will save 160 nS from the 58uS.

Register sizes are not relevant in this context (which is why people
can't understand your jump to Linux) - clearly the number of registers
saved is going to be a drop in the ocean when you are talking about big
cpus running big OS's, rather than microcontrollers running bare-bones
or dedicated OS's (and we have long ago established that saved register
counts is usually, but not always, negligible in those systems too).
Register save sizes is relevant when it is useful to have a response
time of 12 cycles rather than 30 cycles - it is not an issue when the
response time is 5000 cycles!

> I think we all can only laugh here. The funnier thing of course is
> that there is no justified necessity to waste these 160nS - but I
> can understand the programmer who may have wasted them, why would
> he bother - it would be just a waste of his time to chase nanoseconds
> when the system stays masked for tens of microseconds. I would
> not have bothered.
> 

Yes indeed - premature optimisation is the root of all evil, after all.

On 14.1.2015 &#1075;. 16:05, David Brown wrote:
> On 14/01/15 14:14, Dimiter_Popoff wrote:
>> On 14.1.2015 &#1075;. 14:54, David Brown wrote:
>>> On 14/01/15 13:14, Dimiter_Popoff wrote:
>>>> On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote:
>>>>> On 14/01/15 02:11, Dimiter_Popoff wrote:
>>>>>> So what is the guaranteed IRQ latency on your ARM core of choice
>>>>>> running linux with some SATA drives, multiple windows, ethernet,
>>>>>> some serial interfaces. Try to give some figure - please notice
>>>>>> the word "guaranteed", I know how much the linux crowd prefers
>>>>>> to talk "in general".
>>>>>
>>>>> Having L1/L2/L3 caches will instantly introduce a high variation
>>>>> between the mean and max latencies. Even for i486s with their
>>>>> minimal cache and no operating system, a 10:1 variability was
>>>>> visible.
>>>>
>>>> Yes, though on some processors one has the ability to lock part of the
>>>> L1 cache - which allows to have it dedicated to interrupts which can
>>>> make things a lot tighter (by saving the necessity to update entire
>>>> cachelines).
>>>>
>>>> Overall the latency variability obviously increases as processor
>>>> sizes increase but then total execution times decrease, memories
>>>> get faster etc.  so the worst case latency can still be very low.
>>>> On the 5200b which I use I have never needed to resort to any
>>>> cache locks etc., all I do is just stay masked only as absolutely
>>>> necessary.
>>>>
>>>>> Any variability to do with register saving will be completely
>>>>> insignificant compared to the effects of caches. Unless, of
>>>>> course, you are having to dump the entire hidden state of
>>>>> an Itanic processor :)
>>>>>
>>>>
>>>> Well we have not come to that obvious point yet I am afraid :-).
>>>> Let us first have the figure on the worst-case linux IRQ latency
>>>> I asked for then put into its context the try of ARM/linux
>>>> devotees about lower latency by not having enough registers :-).
>>>>
>>>> Dimiter
>>>>
>>>
>>> Neither you nor anyone else can give worst-case IRQ latencies for Linux
>>> running on PPC, MIPS, ARM, x86 or anything else - there is too much
>>> variation.
>>
>> This answer means it is infinite - nice figure in the context of
>> saving a few registers, no doubt about that. Am I supposed to
>> laugh or to cry.
>
> You are supposed to use something other than standard Linux (or Windows)
> when you need hard real time.  If you really need to use Linux and you
> also really need real time, then you can use one of several real-time
> extensions to Linux which will give you a high (compared to dedicated
> RTOS's and more suitable hardware) but definitely not infinite maximum
> latency.
>
> Of course, since you sell a real-time system which /does/ have
> guaranteed worst-case latencies, obviously you should be laughing :-)
>
>>
>> I can give a figure for DPS - and guarantee it, commercially.
>> As an OS DPS is meanwhile no smaller than linux - just the applications
>> written for it are much much fewer. VM, windows, filesystem, networking
>> etc., it is all in there.
>
> I am sure DPS has lots of useful and important features - including
> everything you and your customers need.  But I am also sure it /is/
> smaller than Linux (which is currently at about 17e6 lines for the
> kernel alone) - the comparison is not useful.

Oh but it is - if we compare the OS itself, not the applications.
Meaning what you as a programmer will have as functionality via
system calls. 17e6 lines of wasteful programming could well be
less than mine 1.7e6 lines (not sure about the exact figure),
hard to say. Does their kernel include the support for windows,
offscreen buffers, graphics draw calls etc.?

>  Comparing to vxworks,
> QNX, RTEMS, etc., would make more sense.

Do these come with all the features like windows, VM, filesystem,
networking?

>> And I do have a figure for the latency.
>> So this figure for linux is infinity?
>>
>
> Unless you have calculated it, or at least measured it to a desired
> statistical level of accuracy, then by the definition of "worst case",
> it is infinite.  (You might prefer to say "real time" requires
> calculation, not just measurement - but that gets increasingly difficult
> for more complex systems.  If your tests suggest that missing a timing
> deadline is statistically less likely than being struck by lightning,
> that is often good enough.)

Measuring is OK, calculating is not just difficult, it can be outright
impractical nowadays. One should do it to get a ballpark figure what to
expect then measure it - over a long enough time the worst case response
is not so hard to measure, provided you know what is going on.

> One report I found with Google is for an 800 MHz Cortex A8 chip with
> kernel 2.6.31, testing with and without the "real time" patch (this is
> not a "real-time extension" to Linux, which work in a different way -
> basically the "real-time patch" sacrifices total throughput but allows
> most system calls and functions to be pre-emptable).  Without the
> "real-time patch", maximum measured latencies were 2465 us - with the
> patch, the maximum measured latency was 58 us.

Well 58uS is still OK, only about 5 times (or is it 10 times, I am
not sure whether the 10 uS figure was not on a 200 MHz machine)
worse than DPS at a 400 MHz power (mpc5200b). The question why
is this real time patch not universally applied remains of course,
how much of the functionality do they have to sacrifice if they
use it.

I asked for this figure only to put into its context the claim about
the "need" to save all 32 registers. So let us see - saving 16
registers more to say the slower of the two DDRAM-s, the one on the
400 MHz 5200b (assuming a complete cache miss), 133 MHz clocked
DDRAM, which does something like a 10nS per .l IIRC on average
will save 160 nS from the 58uS.
I think we all can only laugh here. The funnier thing of course is
that there is no justified necessity to waste these 160nS - but I
can understand the programmer who may have wasted them, why would
he bother - it would be just a waste of his time to chase nanoseconds
when the system stays masked for tens of microseconds. I would
not have bothered.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

On 14/01/15 14:58, Dimiter_Popoff wrote:

> Oh but this is your fixation, not mine. You argued that ARM is at an
> advantage because it does not have 32 registers but only 15 and
> put that in the linux context by talking all that ABI and whatever
> abbreviation gibberish the linux crowd constantly invents to
> mask the mess they live in.
> 

Perhaps the abbreviation "ABI" has multiple uses, and you are thinking
of a different one than the rest of us?  In this context, it is
"Application Binary Interface", and is a set of rules for code and
calling convensions for a particular target system.  In some cases, the
ABI will vary from compiler to compiler, or between target OS's - in
other cases, the cpu manufacturer will control it tightly.  In the x86
world, Intel gave very little guidance on an ABI - hence x86 compilers
use wildly different calling conventions.  AMD did better for amd64 -
almost all compilers and OS's on amd64 use AMD's ABI, but of course
Microsoft picked their own incompatible (and inferior) ABI.

In the PPC world, PPC EABI is the standard for embedded systems, with
other ABI's used for AIX, Linux, etc.  The PPC EABI (with 32-bit and
64-bit variations) covers a wide range of standardisations, including
register usage, stack alignment, size of standard types, section names,
standard functions, etc.  It is the ABI that says register R1 is the
stack pointer in the PPC, and that R2 and R13 are anchors for small data
areas (constant and read/write respectively), and that registers R0 and
R3-R12 are "volatile" and must be saved by interrupt wrappers that call
other EABI functions.

I don't know why you assumed the mention of ABI meant people were
talking about Linux.

On 14/01/15 14:14, Dimiter_Popoff wrote:
> On 14.1.2015 &#1075;. 14:54, David Brown wrote:
>> On 14/01/15 13:14, Dimiter_Popoff wrote:
>>> On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote:
>>>> On 14/01/15 02:11, Dimiter_Popoff wrote:
>>>>> So what is the guaranteed IRQ latency on your ARM core of choice
>>>>> running linux with some SATA drives, multiple windows, ethernet,
>>>>> some serial interfaces. Try to give some figure - please notice
>>>>> the word "guaranteed", I know how much the linux crowd prefers
>>>>> to talk "in general".
>>>>
>>>> Having L1/L2/L3 caches will instantly introduce a high variation
>>>> between the mean and max latencies. Even for i486s with their
>>>> minimal cache and no operating system, a 10:1 variability was
>>>> visible.
>>>
>>> Yes, though on some processors one has the ability to lock part of the
>>> L1 cache - which allows to have it dedicated to interrupts which can
>>> make things a lot tighter (by saving the necessity to update entire
>>> cachelines).
>>>
>>> Overall the latency variability obviously increases as processor
>>> sizes increase but then total execution times decrease, memories
>>> get faster etc.  so the worst case latency can still be very low.
>>> On the 5200b which I use I have never needed to resort to any
>>> cache locks etc., all I do is just stay masked only as absolutely
>>> necessary.
>>>
>>>> Any variability to do with register saving will be completely
>>>> insignificant compared to the effects of caches. Unless, of
>>>> course, you are having to dump the entire hidden state of
>>>> an Itanic processor :)
>>>>
>>>
>>> Well we have not come to that obvious point yet I am afraid :-).
>>> Let us first have the figure on the worst-case linux IRQ latency
>>> I asked for then put into its context the try of ARM/linux
>>> devotees about lower latency by not having enough registers :-).
>>>
>>> Dimiter
>>>
>>
>> Neither you nor anyone else can give worst-case IRQ latencies for Linux
>> running on PPC, MIPS, ARM, x86 or anything else - there is too much
>> variation.
> 
> This answer means it is infinite - nice figure in the context of
> saving a few registers, no doubt about that. Am I supposed to
> laugh or to cry.

You are supposed to use something other than standard Linux (or Windows)
when you need hard real time.  If you really need to use Linux and you
also really need real time, then you can use one of several real-time
extensions to Linux which will give you a high (compared to dedicated
RTOS's and more suitable hardware) but definitely not infinite maximum
latency.

Of course, since you sell a real-time system which /does/ have
guaranteed worst-case latencies, obviously you should be laughing :-)

> 
> I can give a figure for DPS - and guarantee it, commercially.
> As an OS DPS is meanwhile no smaller than linux - just the applications
> written for it are much much fewer. VM, windows, filesystem, networking
> etc., it is all in there.

I am sure DPS has lots of useful and important features - including
everything you and your customers need.  But I am also sure it /is/
smaller than Linux (which is currently at about 17e6 lines for the
kernel alone) - the comparison is not useful.  Comparing to vxworks,
QNX, RTEMS, etc., would make more sense.  (And these folks will also
give you figures for latencies - assuming you can give details of the
hardware, and perhaps pay them enough money!)

> And I do have a figure for the latency.
> So this figure for linux is infinity?
> 

Unless you have calculated it, or at least measured it to a desired
statistical level of accuracy, then by the definition of "worst case",
it is infinite.  (You might prefer to say "real time" requires
calculation, not just measurement - but that gets increasingly difficult
for more complex systems.  If your tests suggest that missing a timing
deadline is statistically less likely than being struck by lightning,
that is often good enough.)

One report I found with Google is for an 800 MHz Cortex A8 chip with
kernel 2.6.31, testing with and without the "real time" patch (this is
not a "real-time extension" to Linux, which work in a different way -
basically the "real-time patch" sacrifices total throughput but allows
most system calls and functions to be pre-emptable).  Without the
"real-time patch", maximum measured latencies were 2465 us - with the
patch, the maximum measured latency was 58 us.

Measurements will only be valid on a particular system, with particular
kernel versions, and typical realistic (and worst case) loads - but that
58 us will give you a ballpark figure that's a little lower than infinity.

On 14.1.2015 &#1075;. 15:06, Simon Clubley wrote:
> On 2015-01-13, Dimiter_Popoff <dp@tgi-sci.com> wrote:
>> On 13.1.2015 ?. 22:00, Simon Clubley wrote:
>>> On 2015-01-13, Dimiter_Popoff <dp@tgi-sci.com> wrote:
>>>>
>>>> This is where the next few tons of ink will have to go apparently.
>>>> What on Earth makes you think having 32 registers rather than 15
>>>> makes you have more volatile registers.
>>>
>>> Because it depends on the ABI in use.
>>
>> This is at least the third time I explain this to you but I don't
>> mind, I'll do it as many times as it takes: there are many ways
>> to destroy something working other than inept programming, some
>> of them much easier.
>>
>> So what is the guaranteed IRQ latency on your ARM core of choice
>> running linux with some SATA drives, multiple windows, ethernet,
>> some serial interfaces. Try to give some figure - please notice
>> the word "guaranteed", I know how much the linux crowd prefers
>> to talk "in general".
>>
>
> If I needed to meet guaranteed timing schedules, I wouldn't be using
> Linux to try and achieve them - it simply hasn't been designed for that.

So your answer is "too huge to even look up the exact figure", fairly
similar to the "infinite" David gave.

>
> I don't understand your fixation on the number of registers pushed;
> pushing a few extra registers is a _very_ small price to pay for all
> the advantages of being able to write drivers and other code in a HLL.

Oh but this is your fixation, not mine. You argued that ARM is at an
advantage because it does not have 32 registers but only 15 and
put that in the linux context by talking all that ABI and whatever
abbreviation gibberish the linux crowd constantly invents to
mask the mess they live in.

My point was - still is - that ARM is a crippled load/store machine 
because it has too few registers to be a viable (i.e. pipelined) one.

You (and a few others) wrote tons of irrelevant nonsense about saving
registers, latency etc. - clearly talking without knowing what
you are talking about.

> Note that even when writing HLL code to run on bare metal, the compiler
> still has to generate code against an ABI and hence follow the ABI's
> rules unless you modify the compiler to use your own custom ABI.

Oh this will be the fourth or the fifth time I have to explain this
to you: there are easier ways to destroy something working than
inept programming, a hammer or even a piece of rock will do as nicely.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/

On 14.1.2015 &#1075;. 14:54, David Brown wrote:
> On 14/01/15 13:14, Dimiter_Popoff wrote:
>> On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote:
>>> On 14/01/15 02:11, Dimiter_Popoff wrote:
>>>> So what is the guaranteed IRQ latency on your ARM core of choice
>>>> running linux with some SATA drives, multiple windows, ethernet,
>>>> some serial interfaces. Try to give some figure - please notice
>>>> the word "guaranteed", I know how much the linux crowd prefers
>>>> to talk "in general".
>>>
>>> Having L1/L2/L3 caches will instantly introduce a high variation
>>> between the mean and max latencies. Even for i486s with their
>>> minimal cache and no operating system, a 10:1 variability was
>>> visible.
>>
>> Yes, though on some processors one has the ability to lock part of the
>> L1 cache - which allows to have it dedicated to interrupts which can
>> make things a lot tighter (by saving the necessity to update entire
>> cachelines).
>>
>> Overall the latency variability obviously increases as processor
>> sizes increase but then total execution times decrease, memories
>> get faster etc.  so the worst case latency can still be very low.
>> On the 5200b which I use I have never needed to resort to any
>> cache locks etc., all I do is just stay masked only as absolutely
>> necessary.
>>
>>> Any variability to do with register saving will be completely
>>> insignificant compared to the effects of caches. Unless, of
>>> course, you are having to dump the entire hidden state of
>>> an Itanic processor :)
>>>
>>
>> Well we have not come to that obvious point yet I am afraid :-).
>> Let us first have the figure on the worst-case linux IRQ latency
>> I asked for then put into its context the try of ARM/linux
>> devotees about lower latency by not having enough registers :-).
>>
>> Dimiter
>>
>
> Neither you nor anyone else can give worst-case IRQ latencies for Linux
> running on PPC, MIPS, ARM, x86 or anything else - there is too much
> variation.

This answer means it is infinite - nice figure in the context of
saving a few registers, no doubt about that. Am I supposed to
laugh or to cry.

I can give a figure for DPS - and guarantee it, commercially.
As an OS DPS is meanwhile no smaller than linux - just the applications
written for it are much much fewer. VM, windows, filesystem, networking
etc., it is all in there.
And I do have a figure for the latency.
So this figure for linux is infinity?

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/