Reply by Tauno Voipio January 18, 20152015-01-18
On 18.1.15 19:19, David Brown wrote:
>>> That was needed for Thumb, but not for Thumb2 - you simply use the >>> 32-bit instructions and have access to the same registers as you would >>> with 32-bit ARM codes. If you like, you can think of Thumb2 as being >>> mostly the same as 32-bit ARM (losing a little barrel shifter and >>> conditional execution capability) with the addition of 16-bit >>> "short-cuts" for the most commonly used instructions. >> >> Did they make everything orthogonal and only a matter of instruction >> size? I have to recheck this, have forgotten most of it already. > > No, it is not entirely orthogonal. In particular, common combinations > will have 16-bit Thumb2 instructions, while less common combinations > will have 32-bit Thumb2 instructions. For example, in the ARM, like in > most RISC architectures, there is not a dedicated stack register - you > simply use one of the general registers along with appropriate post and > pre increment and decrement addressing modes. But by convention, and > codified in the ABI, one of the registers (r13 on the ARM, IIRC) is > always used as the stack pointer. Instructions using r13 for these > sorts of addressing modes will be common in the 16-bit Thumb2 encodings, > but use of the same modes with other registers probably needs 32-bit > Thumb2 encodings. (I haven't confirmed the details of this with the ARM > documentation, but the principle is accurate.) The same thing applies > to similar shorted encodings on other processors.
There is more to this: In Cortex M3 and M4, the hardware uses r13 as stack pointer to save the processor state in exception handling. The register can be mirrored to use separate stacks for thread code and exception (and system) code. -- -TV
Reply by David Brown January 18, 20152015-01-18
On 16/01/15 20:30, Vladimir Ivanov wrote:
> > On Sat, 10 Jan 2015, David Brown wrote: > >> On 09/01/15 16:30, Vladimir Ivanov wrote: >>> >>> On Fri, 9 Jan 2015, David Brown wrote: >>> >>>> On 09/01/15 10:54, Vladimir Ivanov wrote: >>>>> >>>>> On Fri, 9 Jan 2015, David Brown wrote: >>>>> >>>>>> For microcontrollers, such as the Cortex M devices, I think 16 >>>>>> registers >>>>>> is a good balance for a lot of typical code. >>>>> >>>>> In Thumb2 you work directly with 8 GP registers, indirectly with few >>>>> like PC and SP, and accessing the rest of the GPRs is different and/or >>>>> has penalties. >>>> >>>> As far as I understand it, accessing the other registers means 32-bit >>>> instructions rather than the short 16-bit instructions. So accessing >>>> them has penalties compared to accessing the faster registers, but not >>>> compared to normal ARM 32-bit instructions. >>> >>> Yes, longer code sequences, and most likely very limited instruction >>> forms. The latter leads to shuffling of data between the regular 8 GPRs >>> and the other, "unregular" GPRs. >>> >> >> That was needed for Thumb, but not for Thumb2 - you simply use the >> 32-bit instructions and have access to the same registers as you would >> with 32-bit ARM codes. If you like, you can think of Thumb2 as being >> mostly the same as 32-bit ARM (losing a little barrel shifter and >> conditional execution capability) with the addition of 16-bit >> "short-cuts" for the most commonly used instructions. > > Did they make everything orthogonal and only a matter of instruction > size? I have to recheck this, have forgotten most of it already.
No, it is not entirely orthogonal. In particular, common combinations will have 16-bit Thumb2 instructions, while less common combinations will have 32-bit Thumb2 instructions. For example, in the ARM, like in most RISC architectures, there is not a dedicated stack register - you simply use one of the general registers along with appropriate post and pre increment and decrement addressing modes. But by convention, and codified in the ABI, one of the registers (r13 on the ARM, IIRC) is always used as the stack pointer. Instructions using r13 for these sorts of addressing modes will be common in the 16-bit Thumb2 encodings, but use of the same modes with other registers probably needs 32-bit Thumb2 encodings. (I haven't confirmed the details of this with the ARM documentation, but the principle is accurate.) The same thing applies to similar shorted encodings on other processors.
> >>> What I am trying to communicate, is that the CPU core with all the >>> blocks is there. Thumb2 is more or less a decoder, just like the ARM >>> mode is. Same with MIPS32 and MIPS16e. Why would one cripple something >>> by removing one of the decoders? The power savings are negligible. >>> >>> ARM7TDMI was more balanced in that regard. >> >> No, the original Thumb instruction set only gave access to some of the >> cpu and let you write significantly slower but more compact code than >> full ARM. That's why they had to keep the ARM decoder too - if you >> needed fast code, you had to use the full instruction set. And no one >> considered the mix of two instruction sets to be "balance" - polite >> people called it a pain in the neck. > > :-) > >> Thumb2 lets you write code that is about 60% of the size of ARM code, >> and is often /faster/ than 32-bit ARM code, since you can get almost >> all of the functionality while being more efficient on your memory >> bandwidth and caches. > > Is this still valid for the big OoO/superscalar cores?
Yes. The actual balance between size ratios and speed ratios varies a little depending on the type of code being run, but I a gather that ARM and Thumb2 encodings are typically within a few percent of the same speed, while Thumb2 code size is between 60% and 80%. For bigger cpus, the processing speed exceeds the memory speed by a greater amount - they are likely to gain more overall speed due to Thumb2 than on smaller cpus.
> >>>> With the original Thumb, ARM kept the normal 32-bit ARM ISA as well >>>> because for some types of code it could be significantly faster. But >>>> with Thumb2, there is almost no code for which the full 32-bit ARM >>>> instructions would beat the Thumb2, taking into account the memory >>>> bandwidth benefits of Thumb2. >>> >>> Any pointers to data showing this? Never heard of it so far, and does >>> not reflect my experience. >>> >>> Why'd they include ARM mode at all in the Cortex-A series? :-) >> >> For backwards compatibility. In Cortex M applications, code is >> generally compiled specifically for the target - so there is no need >> for binary compatibility. But for Cortex A systems, you regularly >> have pre-compiled code from many sources, and binary compatibility >> with older devices is essential. > > That's an interesting angle. Thanks for the comments, I will investigate > some more.
There is also the case that there may be types of code that will be noticeably faster in ARM encoding than Thumb2. On a Cortex-A cpu, including the ARM decoder is a tiny percentage of die size, and can be powered-down when not in use - it is therefore cheap to add if people want to use it. For smaller devices like Cortex-M microcontrollers, the size of an ARM decoder (in addition to Thumb2) would be a much bigger percentage of the die size, and therefore a bigger fraction of the cost.
Reply by Vladimir Ivanov January 16, 20152015-01-16
On Sat, 10 Jan 2015, David Brown wrote:

> On 09/01/15 16:30, Vladimir Ivanov wrote: >> >> On Fri, 9 Jan 2015, David Brown wrote: >> >>> On 09/01/15 10:54, Vladimir Ivanov wrote: >>>> >>>> On Fri, 9 Jan 2015, David Brown wrote: >>>> >>>>> For microcontrollers, such as the Cortex M devices, I think 16 >>>>> registers >>>>> is a good balance for a lot of typical code. >>>> >>>> In Thumb2 you work directly with 8 GP registers, indirectly with few >>>> like PC and SP, and accessing the rest of the GPRs is different >>>> and/or >>>> has penalties. >>> >>> As far as I understand it, accessing the other registers means 32-bit >>> instructions rather than the short 16-bit instructions. So accessing >>> them has penalties compared to accessing the faster registers, but >>> not >>> compared to normal ARM 32-bit instructions. >> >> Yes, longer code sequences, and most likely very limited instruction >> forms. The latter leads to shuffling of data between the regular 8 >> GPRs >> and the other, "unregular" GPRs. >> > > That was needed for Thumb, but not for Thumb2 - you simply use the > 32-bit instructions and have access to the same registers as you would > with 32-bit ARM codes. If you like, you can think of Thumb2 as being > mostly the same as 32-bit ARM (losing a little barrel shifter and > conditional execution capability) with the addition of 16-bit > "short-cuts" for the most commonly used instructions.
Did they make everything orthogonal and only a matter of instruction size? I have to recheck this, have forgotten most of it already.
>> What I am trying to communicate, is that the CPU core with all the >> blocks is there. Thumb2 is more or less a decoder, just like the ARM >> mode is. Same with MIPS32 and MIPS16e. Why would one cripple something >> by removing one of the decoders? The power savings are negligible. >> >> ARM7TDMI was more balanced in that regard. > > No, the original Thumb instruction set only gave access to some of the > cpu and let you write significantly slower but more compact code than > full ARM. That's why they had to keep the ARM decoder too - if you > needed fast code, you had to use the full instruction set. And no one > considered the mix of two instruction sets to be "balance" - polite > people called it a pain in the neck.
:-)
> Thumb2 lets you write code that is about 60% of the size of ARM code, > and is often /faster/ than 32-bit ARM code, since you can get almost all > of the functionality while being more efficient on your memory bandwidth > and caches.
Is this still valid for the big OoO/superscalar cores?
>>> With the original Thumb, ARM kept the normal 32-bit ARM ISA as well >>> because for some types of code it could be significantly faster. But >>> with Thumb2, there is almost no code for which the full 32-bit ARM >>> instructions would beat the Thumb2, taking into account the memory >>> bandwidth benefits of Thumb2. >> >> Any pointers to data showing this? Never heard of it so far, and does >> not reflect my experience. >> >> Why'd they include ARM mode at all in the Cortex-A series? :-) > > For backwards compatibility. In Cortex M applications, code is > generally compiled specifically for the target - so there is no need for > binary compatibility. But for Cortex A systems, you regularly have > pre-compiled code from many sources, and binary compatibility with older > devices is essential.
That's an interesting angle. Thanks for the comments, I will investigate some more.
Reply by Vladimir Ivanov January 16, 20152015-01-16
On Sat, 10 Jan 2015, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

> Vladimir Ivanov <none@none.tld> wrote: >> On Fri, 9 Jan 2015, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote: > >>> MIPS32 support is optional for cores that support microMIPS. In fact, >>> the latest version of Microchip's XC32 compiler includes support for an >>> unreleased PIC32MM family which only supports microMIPS. >> Now that you mention it, I remember seeing pointers about future PIC32MM >> stuck to microMIPS only. Again marketing pressure? > > As far as I can tell from the header files and compiler source code, the > PIC32MM could be a replacement/follow-up for the PIC32MX1xx/2xx. There's > no DSP ASE, and no shadow registers, so it's clearly not a high- > performance chip, and it doesn't seem like it has any special > peripherals either. Using microMIPS at the low end makes sense, as you > can fit more code in a smaller flash. I don't know how much silicon area > is saved by having only the one instruction set, but that kind of makes > sense for a low-end chip as well.
They will shave some Flash space from the MIPS16e -> microMIPS transition, but that won't be revolutionary. But the MIPS32 -> microMIPS will be noticeable, yes. Maybe MIPS16e is not that popular after all. The silicon savings of MM's core are probably close to none, I think this is mostly for the user's comfort of staying into a single mode and having a distinguished Thumb-2 competitor. I wouldn't be surprised if the MIPS32 decoder is present in the macro cell, just fused/disabled. It is only a speculation, of course, but keeping less cores is a sane choice. Still, the MM might be interesting.
Reply by David Brown January 14, 20152015-01-14
On 14/01/15 15:41, Dimiter_Popoff wrote:
> On 14.1.2015 &#1075;. 16:05, David Brown wrote: >> On 14/01/15 14:14, Dimiter_Popoff wrote: >>> On 14.1.2015 &#1075;. 14:54, David Brown wrote: >>>> On 14/01/15 13:14, Dimiter_Popoff wrote: >>>>> On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote: >>>>>> On 14/01/15 02:11, Dimiter_Popoff wrote: >>>>>>> So what is the guaranteed IRQ latency on your ARM core of choice >>>>>>> running linux with some SATA drives, multiple windows, ethernet, >>>>>>> some serial interfaces. Try to give some figure - please notice >>>>>>> the word "guaranteed", I know how much the linux crowd prefers >>>>>>> to talk "in general". >>>>>> >>>>>> Having L1/L2/L3 caches will instantly introduce a high variation >>>>>> between the mean and max latencies. Even for i486s with their >>>>>> minimal cache and no operating system, a 10:1 variability was >>>>>> visible. >>>>> >>>>> Yes, though on some processors one has the ability to lock part of the >>>>> L1 cache - which allows to have it dedicated to interrupts which can >>>>> make things a lot tighter (by saving the necessity to update entire >>>>> cachelines). >>>>> >>>>> Overall the latency variability obviously increases as processor >>>>> sizes increase but then total execution times decrease, memories >>>>> get faster etc. so the worst case latency can still be very low. >>>>> On the 5200b which I use I have never needed to resort to any >>>>> cache locks etc., all I do is just stay masked only as absolutely >>>>> necessary. >>>>> >>>>>> Any variability to do with register saving will be completely >>>>>> insignificant compared to the effects of caches. Unless, of >>>>>> course, you are having to dump the entire hidden state of >>>>>> an Itanic processor :) >>>>>> >>>>> >>>>> Well we have not come to that obvious point yet I am afraid :-). >>>>> Let us first have the figure on the worst-case linux IRQ latency >>>>> I asked for then put into its context the try of ARM/linux >>>>> devotees about lower latency by not having enough registers :-). >>>>> >>>>> Dimiter >>>>> >>>> >>>> Neither you nor anyone else can give worst-case IRQ latencies for Linux >>>> running on PPC, MIPS, ARM, x86 or anything else - there is too much >>>> variation. >>> >>> This answer means it is infinite - nice figure in the context of >>> saving a few registers, no doubt about that. Am I supposed to >>> laugh or to cry. >> >> You are supposed to use something other than standard Linux (or Windows) >> when you need hard real time. If you really need to use Linux and you >> also really need real time, then you can use one of several real-time >> extensions to Linux which will give you a high (compared to dedicated >> RTOS's and more suitable hardware) but definitely not infinite maximum >> latency. >> >> Of course, since you sell a real-time system which /does/ have >> guaranteed worst-case latencies, obviously you should be laughing :-) >> >>> >>> I can give a figure for DPS - and guarantee it, commercially. >>> As an OS DPS is meanwhile no smaller than linux - just the applications >>> written for it are much much fewer. VM, windows, filesystem, networking >>> etc., it is all in there. >> >> I am sure DPS has lots of useful and important features - including >> everything you and your customers need. But I am also sure it /is/ >> smaller than Linux (which is currently at about 17e6 lines for the >> kernel alone) - the comparison is not useful. > > Oh but it is - if we compare the OS itself, not the applications. > Meaning what you as a programmer will have as functionality via > system calls. 17e6 lines of wasteful programming could well be > less than mine 1.7e6 lines (not sure about the exact figure), > hard to say. Does their kernel include the support for windows, > offscreen buffers, graphics draw calls etc.? > >> Comparing to vxworks, >> QNX, RTEMS, etc., would make more sense. > > Do these come with all the features like windows, VM, filesystem, > networking?
I am not sure that this is the best place to give a beginners course on Linux, QNX, RTEMS, or operating systems in general. Obviously you have vast experience with DPS - yet your questions show a lack of knowledge of how these sorts of OS's are built up and structured. I can't tell if you really know so little about what Linux is, and what an OS kernel is, or if you are being intentionally na&iuml;ve - I have no wish to sound patronising and write about things you have worked with every day for twenty years, but equally I am happy to explain things if it is helpful to you. Can I just say you should read the Wikipedia articles plus each project's home page, and if we need to go further then we'll take it from there?
> >>> And I do have a figure for the latency. >>> So this figure for linux is infinity? >>> >> >> Unless you have calculated it, or at least measured it to a desired >> statistical level of accuracy, then by the definition of "worst case", >> it is infinite. (You might prefer to say "real time" requires >> calculation, not just measurement - but that gets increasingly difficult >> for more complex systems. If your tests suggest that missing a timing >> deadline is statistically less likely than being struck by lightning, >> that is often good enough.) > > Measuring is OK, calculating is not just difficult, it can be outright > impractical nowadays. One should do it to get a ballpark figure what to > expect then measure it - over a long enough time the worst case response > is not so hard to measure, provided you know what is going on. >
Agreed.
>> One report I found with Google is for an 800 MHz Cortex A8 chip with >> kernel 2.6.31, testing with and without the "real time" patch (this is >> not a "real-time extension" to Linux, which work in a different way - >> basically the "real-time patch" sacrifices total throughput but allows >> most system calls and functions to be pre-emptable). Without the >> "real-time patch", maximum measured latencies were 2465 us - with the >> patch, the maximum measured latency was 58 us. > > Well 58uS is still OK, only about 5 times (or is it 10 times, I am > not sure whether the 10 uS figure was not on a 200 MHz machine) > worse than DPS at a 400 MHz power (mpc5200b). The question why > is this real time patch not universally applied remains of course, > how much of the functionality do they have to sacrifice if they > use it.
In Linux, interrupts get passed on to kernel interrupt threads, and thus involve a (limited) context switch. That is always going to be more costly than handling the interrupt directly, but allows the interrupt code more access to kernel functions. As far as I understand the RT patch, there are two issues regarding universal application in the kernel. One is that improving worst-case response times means minimising the size of critical sections with interrupts disabled. The other is that much more of the kernel is preemptable and re-entrant, and uses finer grain locking. So code that used to be "get lock, do A, B, C, release lock" might be changed to "get lock, do A, release lock, do B, get lock, do C, release lock". The locked (or interrupt disabled) sections are shorter, but total throughput is reduced as there is more overhead in the locking. In particular, I gather than most spin-locks (which are very fast at taking a free lock) are replaced by mutexes with priority inheritance. Certainly some aspects have moved into the main kernel - modern Linux kernels have a lot finer grained locking than older ones, which tended to use the "big kernel lock" a great deal. The main motivation here is for SMP systems - when Linux systems were generally on one core, a single "large" lock was okay, but with multiple cores it gets very inefficient. Other aspects are configurable (as are many things in Linux) - you often want a different balance between throughput and response times for server systems, desktops, and embedded systems.
> > I asked for this figure only to put into its context the claim about > the "need" to save all 32 registers. So let us see - saving 16 > registers more to say the slower of the two DDRAM-s, the one on the > 400 MHz 5200b (assuming a complete cache miss), 133 MHz clocked > DDRAM, which does something like a 10nS per .l IIRC on average > will save 160 nS from the 58uS.
Register sizes are not relevant in this context (which is why people can't understand your jump to Linux) - clearly the number of registers saved is going to be a drop in the ocean when you are talking about big cpus running big OS's, rather than microcontrollers running bare-bones or dedicated OS's (and we have long ago established that saved register counts is usually, but not always, negligible in those systems too). Register save sizes is relevant when it is useful to have a response time of 12 cycles rather than 30 cycles - it is not an issue when the response time is 5000 cycles!
> I think we all can only laugh here. The funnier thing of course is > that there is no justified necessity to waste these 160nS - but I > can understand the programmer who may have wasted them, why would > he bother - it would be just a waste of his time to chase nanoseconds > when the system stays masked for tens of microseconds. I would > not have bothered. >
Yes indeed - premature optimisation is the root of all evil, after all.
Reply by Dimiter_Popoff January 14, 20152015-01-14
On 14.1.2015 &#1075;. 16:05, David Brown wrote:
> On 14/01/15 14:14, Dimiter_Popoff wrote: >> On 14.1.2015 &#1075;. 14:54, David Brown wrote: >>> On 14/01/15 13:14, Dimiter_Popoff wrote: >>>> On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote: >>>>> On 14/01/15 02:11, Dimiter_Popoff wrote: >>>>>> So what is the guaranteed IRQ latency on your ARM core of choice >>>>>> running linux with some SATA drives, multiple windows, ethernet, >>>>>> some serial interfaces. Try to give some figure - please notice >>>>>> the word "guaranteed", I know how much the linux crowd prefers >>>>>> to talk "in general". >>>>> >>>>> Having L1/L2/L3 caches will instantly introduce a high variation >>>>> between the mean and max latencies. Even for i486s with their >>>>> minimal cache and no operating system, a 10:1 variability was >>>>> visible. >>>> >>>> Yes, though on some processors one has the ability to lock part of the >>>> L1 cache - which allows to have it dedicated to interrupts which can >>>> make things a lot tighter (by saving the necessity to update entire >>>> cachelines). >>>> >>>> Overall the latency variability obviously increases as processor >>>> sizes increase but then total execution times decrease, memories >>>> get faster etc. so the worst case latency can still be very low. >>>> On the 5200b which I use I have never needed to resort to any >>>> cache locks etc., all I do is just stay masked only as absolutely >>>> necessary. >>>> >>>>> Any variability to do with register saving will be completely >>>>> insignificant compared to the effects of caches. Unless, of >>>>> course, you are having to dump the entire hidden state of >>>>> an Itanic processor :) >>>>> >>>> >>>> Well we have not come to that obvious point yet I am afraid :-). >>>> Let us first have the figure on the worst-case linux IRQ latency >>>> I asked for then put into its context the try of ARM/linux >>>> devotees about lower latency by not having enough registers :-). >>>> >>>> Dimiter >>>> >>> >>> Neither you nor anyone else can give worst-case IRQ latencies for Linux >>> running on PPC, MIPS, ARM, x86 or anything else - there is too much >>> variation. >> >> This answer means it is infinite - nice figure in the context of >> saving a few registers, no doubt about that. Am I supposed to >> laugh or to cry. > > You are supposed to use something other than standard Linux (or Windows) > when you need hard real time. If you really need to use Linux and you > also really need real time, then you can use one of several real-time > extensions to Linux which will give you a high (compared to dedicated > RTOS's and more suitable hardware) but definitely not infinite maximum > latency. > > Of course, since you sell a real-time system which /does/ have > guaranteed worst-case latencies, obviously you should be laughing :-) > >> >> I can give a figure for DPS - and guarantee it, commercially. >> As an OS DPS is meanwhile no smaller than linux - just the applications >> written for it are much much fewer. VM, windows, filesystem, networking >> etc., it is all in there. > > I am sure DPS has lots of useful and important features - including > everything you and your customers need. But I am also sure it /is/ > smaller than Linux (which is currently at about 17e6 lines for the > kernel alone) - the comparison is not useful.
Oh but it is - if we compare the OS itself, not the applications. Meaning what you as a programmer will have as functionality via system calls. 17e6 lines of wasteful programming could well be less than mine 1.7e6 lines (not sure about the exact figure), hard to say. Does their kernel include the support for windows, offscreen buffers, graphics draw calls etc.?
> Comparing to vxworks, > QNX, RTEMS, etc., would make more sense.
Do these come with all the features like windows, VM, filesystem, networking?
>> And I do have a figure for the latency. >> So this figure for linux is infinity? >> > > Unless you have calculated it, or at least measured it to a desired > statistical level of accuracy, then by the definition of "worst case", > it is infinite. (You might prefer to say "real time" requires > calculation, not just measurement - but that gets increasingly difficult > for more complex systems. If your tests suggest that missing a timing > deadline is statistically less likely than being struck by lightning, > that is often good enough.)
Measuring is OK, calculating is not just difficult, it can be outright impractical nowadays. One should do it to get a ballpark figure what to expect then measure it - over a long enough time the worst case response is not so hard to measure, provided you know what is going on.
> One report I found with Google is for an 800 MHz Cortex A8 chip with > kernel 2.6.31, testing with and without the "real time" patch (this is > not a "real-time extension" to Linux, which work in a different way - > basically the "real-time patch" sacrifices total throughput but allows > most system calls and functions to be pre-emptable). Without the > "real-time patch", maximum measured latencies were 2465 us - with the > patch, the maximum measured latency was 58 us.
Well 58uS is still OK, only about 5 times (or is it 10 times, I am not sure whether the 10 uS figure was not on a 200 MHz machine) worse than DPS at a 400 MHz power (mpc5200b). The question why is this real time patch not universally applied remains of course, how much of the functionality do they have to sacrifice if they use it. I asked for this figure only to put into its context the claim about the "need" to save all 32 registers. So let us see - saving 16 registers more to say the slower of the two DDRAM-s, the one on the 400 MHz 5200b (assuming a complete cache miss), 133 MHz clocked DDRAM, which does something like a 10nS per .l IIRC on average will save 160 nS from the 58uS. I think we all can only laugh here. The funnier thing of course is that there is no justified necessity to waste these 160nS - but I can understand the programmer who may have wasted them, why would he bother - it would be just a waste of his time to chase nanoseconds when the system stays masked for tens of microseconds. I would not have bothered. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
Reply by David Brown January 14, 20152015-01-14
On 14/01/15 14:58, Dimiter_Popoff wrote:

> Oh but this is your fixation, not mine. You argued that ARM is at an > advantage because it does not have 32 registers but only 15 and > put that in the linux context by talking all that ABI and whatever > abbreviation gibberish the linux crowd constantly invents to > mask the mess they live in. >
Perhaps the abbreviation "ABI" has multiple uses, and you are thinking of a different one than the rest of us? In this context, it is "Application Binary Interface", and is a set of rules for code and calling convensions for a particular target system. In some cases, the ABI will vary from compiler to compiler, or between target OS's - in other cases, the cpu manufacturer will control it tightly. In the x86 world, Intel gave very little guidance on an ABI - hence x86 compilers use wildly different calling conventions. AMD did better for amd64 - almost all compilers and OS's on amd64 use AMD's ABI, but of course Microsoft picked their own incompatible (and inferior) ABI. In the PPC world, PPC EABI is the standard for embedded systems, with other ABI's used for AIX, Linux, etc. The PPC EABI (with 32-bit and 64-bit variations) covers a wide range of standardisations, including register usage, stack alignment, size of standard types, section names, standard functions, etc. It is the ABI that says register R1 is the stack pointer in the PPC, and that R2 and R13 are anchors for small data areas (constant and read/write respectively), and that registers R0 and R3-R12 are "volatile" and must be saved by interrupt wrappers that call other EABI functions. I don't know why you assumed the mention of ABI meant people were talking about Linux.
Reply by David Brown January 14, 20152015-01-14
On 14/01/15 14:14, Dimiter_Popoff wrote:
> On 14.1.2015 &#1075;. 14:54, David Brown wrote: >> On 14/01/15 13:14, Dimiter_Popoff wrote: >>> On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote: >>>> On 14/01/15 02:11, Dimiter_Popoff wrote: >>>>> So what is the guaranteed IRQ latency on your ARM core of choice >>>>> running linux with some SATA drives, multiple windows, ethernet, >>>>> some serial interfaces. Try to give some figure - please notice >>>>> the word "guaranteed", I know how much the linux crowd prefers >>>>> to talk "in general". >>>> >>>> Having L1/L2/L3 caches will instantly introduce a high variation >>>> between the mean and max latencies. Even for i486s with their >>>> minimal cache and no operating system, a 10:1 variability was >>>> visible. >>> >>> Yes, though on some processors one has the ability to lock part of the >>> L1 cache - which allows to have it dedicated to interrupts which can >>> make things a lot tighter (by saving the necessity to update entire >>> cachelines). >>> >>> Overall the latency variability obviously increases as processor >>> sizes increase but then total execution times decrease, memories >>> get faster etc. so the worst case latency can still be very low. >>> On the 5200b which I use I have never needed to resort to any >>> cache locks etc., all I do is just stay masked only as absolutely >>> necessary. >>> >>>> Any variability to do with register saving will be completely >>>> insignificant compared to the effects of caches. Unless, of >>>> course, you are having to dump the entire hidden state of >>>> an Itanic processor :) >>>> >>> >>> Well we have not come to that obvious point yet I am afraid :-). >>> Let us first have the figure on the worst-case linux IRQ latency >>> I asked for then put into its context the try of ARM/linux >>> devotees about lower latency by not having enough registers :-). >>> >>> Dimiter >>> >> >> Neither you nor anyone else can give worst-case IRQ latencies for Linux >> running on PPC, MIPS, ARM, x86 or anything else - there is too much >> variation. > > This answer means it is infinite - nice figure in the context of > saving a few registers, no doubt about that. Am I supposed to > laugh or to cry.
You are supposed to use something other than standard Linux (or Windows) when you need hard real time. If you really need to use Linux and you also really need real time, then you can use one of several real-time extensions to Linux which will give you a high (compared to dedicated RTOS's and more suitable hardware) but definitely not infinite maximum latency. Of course, since you sell a real-time system which /does/ have guaranteed worst-case latencies, obviously you should be laughing :-)
> > I can give a figure for DPS - and guarantee it, commercially. > As an OS DPS is meanwhile no smaller than linux - just the applications > written for it are much much fewer. VM, windows, filesystem, networking > etc., it is all in there.
I am sure DPS has lots of useful and important features - including everything you and your customers need. But I am also sure it /is/ smaller than Linux (which is currently at about 17e6 lines for the kernel alone) - the comparison is not useful. Comparing to vxworks, QNX, RTEMS, etc., would make more sense. (And these folks will also give you figures for latencies - assuming you can give details of the hardware, and perhaps pay them enough money!)
> And I do have a figure for the latency. > So this figure for linux is infinity? >
Unless you have calculated it, or at least measured it to a desired statistical level of accuracy, then by the definition of "worst case", it is infinite. (You might prefer to say "real time" requires calculation, not just measurement - but that gets increasingly difficult for more complex systems. If your tests suggest that missing a timing deadline is statistically less likely than being struck by lightning, that is often good enough.) One report I found with Google is for an 800 MHz Cortex A8 chip with kernel 2.6.31, testing with and without the "real time" patch (this is not a "real-time extension" to Linux, which work in a different way - basically the "real-time patch" sacrifices total throughput but allows most system calls and functions to be pre-emptable). Without the "real-time patch", maximum measured latencies were 2465 us - with the patch, the maximum measured latency was 58 us. Measurements will only be valid on a particular system, with particular kernel versions, and typical realistic (and worst case) loads - but that 58 us will give you a ballpark figure that's a little lower than infinity.
Reply by Dimiter_Popoff January 14, 20152015-01-14
On 14.1.2015 &#1075;. 15:06, Simon Clubley wrote:
> On 2015-01-13, Dimiter_Popoff <dp@tgi-sci.com> wrote: >> On 13.1.2015 ?. 22:00, Simon Clubley wrote: >>> On 2015-01-13, Dimiter_Popoff <dp@tgi-sci.com> wrote: >>>> >>>> This is where the next few tons of ink will have to go apparently. >>>> What on Earth makes you think having 32 registers rather than 15 >>>> makes you have more volatile registers. >>> >>> Because it depends on the ABI in use. >> >> This is at least the third time I explain this to you but I don't >> mind, I'll do it as many times as it takes: there are many ways >> to destroy something working other than inept programming, some >> of them much easier. >> >> So what is the guaranteed IRQ latency on your ARM core of choice >> running linux with some SATA drives, multiple windows, ethernet, >> some serial interfaces. Try to give some figure - please notice >> the word "guaranteed", I know how much the linux crowd prefers >> to talk "in general". >> > > If I needed to meet guaranteed timing schedules, I wouldn't be using > Linux to try and achieve them - it simply hasn't been designed for that.
So your answer is "too huge to even look up the exact figure", fairly similar to the "infinite" David gave.
> > I don't understand your fixation on the number of registers pushed; > pushing a few extra registers is a _very_ small price to pay for all > the advantages of being able to write drivers and other code in a HLL.
Oh but this is your fixation, not mine. You argued that ARM is at an advantage because it does not have 32 registers but only 15 and put that in the linux context by talking all that ABI and whatever abbreviation gibberish the linux crowd constantly invents to mask the mess they live in. My point was - still is - that ARM is a crippled load/store machine because it has too few registers to be a viable (i.e. pipelined) one. You (and a few others) wrote tons of irrelevant nonsense about saving registers, latency etc. - clearly talking without knowing what you are talking about.
> Note that even when writing HLL code to run on bare metal, the compiler > still has to generate code against an ABI and hence follow the ABI's > rules unless you modify the compiler to use your own custom ABI.
Oh this will be the fourth or the fifth time I have to explain this to you: there are easier ways to destroy something working than inept programming, a hammer or even a piece of rock will do as nicely. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
Reply by Dimiter_Popoff January 14, 20152015-01-14
On 14.1.2015 &#1075;. 14:54, David Brown wrote:
> On 14/01/15 13:14, Dimiter_Popoff wrote: >> On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote: >>> On 14/01/15 02:11, Dimiter_Popoff wrote: >>>> So what is the guaranteed IRQ latency on your ARM core of choice >>>> running linux with some SATA drives, multiple windows, ethernet, >>>> some serial interfaces. Try to give some figure - please notice >>>> the word "guaranteed", I know how much the linux crowd prefers >>>> to talk "in general". >>> >>> Having L1/L2/L3 caches will instantly introduce a high variation >>> between the mean and max latencies. Even for i486s with their >>> minimal cache and no operating system, a 10:1 variability was >>> visible. >> >> Yes, though on some processors one has the ability to lock part of the >> L1 cache - which allows to have it dedicated to interrupts which can >> make things a lot tighter (by saving the necessity to update entire >> cachelines). >> >> Overall the latency variability obviously increases as processor >> sizes increase but then total execution times decrease, memories >> get faster etc. so the worst case latency can still be very low. >> On the 5200b which I use I have never needed to resort to any >> cache locks etc., all I do is just stay masked only as absolutely >> necessary. >> >>> Any variability to do with register saving will be completely >>> insignificant compared to the effects of caches. Unless, of >>> course, you are having to dump the entire hidden state of >>> an Itanic processor :) >>> >> >> Well we have not come to that obvious point yet I am afraid :-). >> Let us first have the figure on the worst-case linux IRQ latency >> I asked for then put into its context the try of ARM/linux >> devotees about lower latency by not having enough registers :-). >> >> Dimiter >> > > Neither you nor anyone else can give worst-case IRQ latencies for Linux > running on PPC, MIPS, ARM, x86 or anything else - there is too much > variation.
This answer means it is infinite - nice figure in the context of saving a few registers, no doubt about that. Am I supposed to laugh or to cry. I can give a figure for DPS - and guarantee it, commercially. As an OS DPS is meanwhile no smaller than linux - just the applications written for it are much much fewer. VM, windows, filesystem, networking etc., it is all in there. And I do have a figure for the latency. So this figure for linux is infinity? Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/