EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Integrated TFT controller in PIC MCUs

Started by pozz January 7, 2015
glen herrmannsfeldt schreef op 10-Jan-15 om 3:25 AM:
> Simon Clubley <clubley@remove_me.eisner.decus.org-earth.ufp> wrote: >> On 2015-01-09, Dimiter_Popoff <dp@tgi-sci.com> wrote: > >>> Well I really cannot simplify the concept of saving say 4 out of 32 >>> registers, using only them in an IRQ handler, then restoring only >>> them and returning from the exception. > >> In the general case, you have to push and pop all the registers every >> time you take an interrupt. > > Some processors have multiple register sets that might avoid that. > SPARC has register windows, such that they don't have to save to > memory until all the windows are in use. I don't know if that is > for interrupts, too.
That is a nice concept, but it has two problems: - when you overflow the avaiable resister sets, you must spill to memory. whether you need to do this, depends on where you are in the register set. this makes timing difficult to predict, which is not nice for a real-time system - imagine a context switch. now you have to save/restore all register sets! In general, fat-context CPUs are better at single-threaded no-interrupt applications, but worse for switch-often interrupt-heavy applications. Wouter van Ooijen
Anders.Montonen@kapsi.spam.stop.fi.invalid schreef op 10-Jan-15 om 10:51 AM:
> Vladimir Ivanov <none@none.tld> wrote: > >> Do all instruction forms have wider version to accomodate R8-R15 usage? > > I believe that is the case.
Not in Cortex-M0. About the only thing you can do with the upper registers (r8-12) is copy to/from a lower register. As an illustration: a cooperative context switch on a cortex-m0: .cpu cortex-m0 .global switch_from_to .text .align 2 // extern "C" void switch_from_to( // int ** current_stack_pointer, // int * next_stack_pointer // ); switch_from_to: // save current context on the stack push { r4 - r7, lr } mov r2, r8 mov r3, r9 mov r4, r10 mov r5, r11 mov r6, r12 push { r2 - r6 } // *current_stack_pointer = sp mov r2, sp str r2, [ r0 ] // sp = next_stack_pointer mov sp, r1 // restore the new context from the new stack pop { r2 - r6 } mov r12, r6 mov r11, r5 mov r10, r4 mov r9, r3 mov r8, r2 pop { r4 - r7, pc } Wouter van Ooijen
Wouter van Ooijen <wouter@voti.nl> wrote:
> Anders.Montonen@kapsi.spam.stop.fi.invalid schreef op 10-Jan-15 om 10:51 AM: >> Vladimir Ivanov <none@none.tld> wrote: >> >>> Do all instruction forms have wider version to accomodate R8-R15 usage? >> I believe that is the case. > Not in Cortex-M0.
That is true, but they implement ARMv6-M Thumb, not Thumb-2. -a
Anders.Montonen@kapsi.spam.stop.fi.invalid schreef op 10-Jan-15 om 11:45 AM:
> Wouter van Ooijen <wouter@voti.nl> wrote: >> Anders.Montonen@kapsi.spam.stop.fi.invalid schreef op 10-Jan-15 om 10:51 AM: >>> Vladimir Ivanov <none@none.tld> wrote: >>> >>>> Do all instruction forms have wider version to accomodate R8-R15 usage? >>> I believe that is the case. >> Not in Cortex-M0. > > That is true, but they implement ARMv6-M Thumb, not Thumb-2.
In that case I was confused about the context of the statement :) Wouter
On 09/01/15 23:04, Dimiter_Popoff wrote:
> On 09.1.2015 &#1075;. 11:47, David Brown wrote: >> On 09/01/15 00:22, Dimiter_Popoff wrote: >>> On 09.1.2015 &#1075;. 00:53, Wouter van Ooijen wrote: >>>> Dimiter_Popoff schreef op 08-Jan-15 om 11:18 PM: >>>>> On 08.1.2015 &#1075;. 23:25, David Brown wrote: >>>>>> ... (Just as "cpus should >>>>>> have more than 16 core registers" is a good reason for disliking >>>>>> ARM's, >>>>>> if that is your opinion.) >>>>> >>>>> Results of arithmetic calculations are not exactly what I would >>>>> call an >>>>> opinion. >>>>> 16 registers - one of which being reserved for the PC - are too few >>>>> for >>>>> a load/store machine. >>>> >>>> 16 is a fact, but the rest is nothing but opinion. >>>> >>>> > Clearly it will work but under equal conditions >>>>> will be slower than if it had 32 registers, sometimes much slower. >>>> >>>> More registers can be slower too. >>> >>> Yes, 32 is about the optimum I suppose. But I have not really analyzed >>> that, what I have - and demonstrated by an example which you chose to >>> ignore - is the comparison 32 vs. 16. >> >> Certainly it is possible to pick examples where 32 registers is more >> effective than 16 - but equally we can pick examples where 16 registers >> is more efficient (such as context switching, or code with a lot of >> small functions). Examples are illustrative, but not proof of a general >> rule. > > You have yet to prove this point. Context switching is not a valid > example, as I explained in my former post which you must have read > prior to replying to (for those who have not, context switching is > responsible for a fraction of a percent of CPU time, consequently > halving or even completely eliminating that can bring a fraction of > a percent improvement, i.e. it is negligible).
I think it would be wrong to talk about /proving/ points here - proper proof would require implementing the same algorithms in different architectures (or preferably in the same basic architecture, but with different register counts) and comparing code densities, run times, cache hit/miss counts, memory bandwidth, etc. That is clearly far beyond what anyone will do for a Usegroup post! Context switching is not a valid example for /you/, based on the figures /you/ gave. But it is certainly not something that can be dismissed so easily. I have written systems that had timer interrupts at 100,000 times per second - thus interrupt overhead is 100 times as important as in your single example with 1000 interrupts per second. I have written one system where there were 40 clock between interrupts - that does not leave a lot of time for context saving and restoring. In general, in bigger systems (and PowerPC cores tend to be used in bigger systems than most of the embedded cores we see here) you try to avoid many interrupts, and prefer DMA and more sophisticated peripherals to keep the interrupt rate low. In smaller microcontrollers, you don't have such sophistication - your UART might have only a single buffer, so you will have interrupts for every character transmitted or received. But interrupts are less of an overhead, partly because of the lower register count. It's a different balance. Other than interrupt context switches, there is also function call overhead. I don't know what sort of calling conventions you use in your VPA, but for high-level languages there is normally a defined convention with some registers being caller-saved (or "volatile") and others being callee-saved (or "non-volatile"). The split can vary between compilers on a target, or can been defined by the target's standard ABI. When you are calling unknown code (i.e., code from a separately compiled module), the compiler (or assembly programmer) must follow the calling convention. That means a function must save any callee-saved registers before using them, in case the calling function had data there. And it must save caller-saved registers before function calls, in case the called function uses them. There is always a significant amount of unnecessary saving and restoring in this process, and it increases with the number of registers in the system and with the number of small and simple functions (because with larger functions that really use the data, the save/restores are no longer unnecessary and therefore not overhead). So if a PPC function (following the standard PPC EABI conventions) needs to use any of the 18 "non-volatile" registers, it must save the old values and restore them on exit - even if the calling function does not need the old values. And if it is calling another function, then it must save any of the 11 "volatile" registers it wants to keep - even if the called function does not touch them.
> > I would certainly be interested in the example you claim to be able > to produce demonstrating how 16 registers can be more effective. > My example covers a typical, widespread application - that of a FIR. > Let us see yours.
I can't produce a /function/ that is more efficient with 16 registers than 32 registers - but I hope that above I have explained how it make a difference with chains of small functions (or functions with few register demands). I think it is reasonable to say that as programs (in the embedded world) have got bigger, memories have got bigger, and compilers have got better, then the balance for many systems has moved more towards 32 registers rather than 16 registers, in the same way that it has moved towards 32-bit cpus from 8 and 16-bit cpus.
> >>>> More registers => more bits in an instruction to specify a register => >>>> more to read from code mempry => slower >>> >>> Not really. Load/store machines typically have a fixed 32 bit >>> instruction word; using that for only 16 registers is simply >>> waste of space, you will have less information packed in the >>> opcode thus you will need more opcodes to do the same job, thus >>> having to do *more* memory fetches. >> >> Many load/store machines have some sort of compressed or limited >> instruction format for greater efficiency (especially of instruction >> cache). ARM has a couple of "thumb" modes, MIPs have an equivalent, and >> on the PPC I have seen various schemes. > > I'd be interested to see those sub-32 bit opcode schemes you talk about > on power, I have yet to encounter one of them. Certainly none of them > is present on the cores I use or have investigated. (I am just curious, > not in need of something like that).
The one I am most familiar with is VLE, which is used by many of Freescale's PPC microcontrollers. (It may be used by others too, but I have only used PPC from Freescale.) I also remember that Freescale had some chips with a sort of compression scheme where code was decompressed while it was loaded into cache - but I have forgotten the details. VLE is the same basic idea as ARM's Thumb2 and the MIPS equivalent.
> >> Even in the full 32-bit ARM set, the bits "saved" by having fewer >> registers are used for the conditional execution bits and the barrel >> shifter. > > Can you please elaborate on that. What can ARM do using the barrel > shifter which you cannot do using the likes of rlwinm, rlwnm or rlwimi > on power?
The barrel shifter bits apply to a wide range of ARM instructions - meaning you can effectively tag a zero-cost shift instruction into many other instructions. Thus if you want "a = b + c * 16;", on the PPC you need two instructions (shift then add) with pipeline/scheduling considerations between them, while on the ARM it is done in one instruction and one cycle. With 16-bit Thumb instructions, the barrel shifter only works on loads, stores, and specific rotate/shift instructions - with 32-bit Thumb and full ARM instructions it works on a wide range of instructions.
> >> Every bit of space in the instruction set is important. Using them to >> support extra registers may be the best overall tradeoff in some cases, >> but it is always a tradeoff. > > This kind of general talk leads nowhere here as you may have found out > in previous discussions. If you cannot support a claim by a valid > particular example you are saying nothing. > > I explained that 16 registers are too few for a load/store machine, > gave an example to make it easier to understand why (to be able to > compensate for the pipeline delay). > > You are just repeating your opinion basing it on nothing.
And you in turn are generalising based on an example and your own very specialised experience. I hope that what I wrote earlier makes it clear why I think 16 registers can be an advantage, and why I think your example cannot count as a general proof or argument (though I happily accept it as an example of when 32 registers is very useful). Beyond that, I would say that even if 16 registers is not /more/ efficient than 32 registers, I have yet to see any reasoning for why 32 registers is /significantly/ more useful for the type of code generally seen in microcontrollers (and yes, I have no choice here but to talk of generalities). A filter algorithm might run faster with more registers - but it would do even better by using additional DSP-specific support (such as the DSP registers and instructions on the Cortex M4 compared to the M3) or SIMD support (Alitvec on PPC, Neon on ARM). I will try not to repeat myself, but I hope you can see that this is necessarily a generalised discussion, based strongly on opinion and personal experience - unless you want to give several whole programs implemented on at least two architectures using commonly used tools and techniques (i.e., C code rather than VLA code) as real evidence.
> >>> If you become familiar with the power architecture instruction set >>> you will find very little room for performance improvement, the >>> person who did it knew what he was doing really well. >>> >> >> It's a nice architecture (apart from the backwards bit numbering!). > > The bit numbering is simply big endian. I also > have had my trouble with it of course but it is easy to get used to.
Yes, it is fine when you are used to it - but very weird to start with, and different from everything else (including the numbering used on other big-endian processors I have used, such as M68K/ColdFire). You have to double-check everything when you are connecting address line A31 on the MPC to A0 on the external RAM chip! And then Freescale's documentation freely mixes up 64-bit PPC conventions with the 32-bit conventions, so you find you are trying to set "bit 59" of a register by writing the value 16... It was even more "fun" when I found Freescale had also mixed it up in a couple of register definitions in a header file.
> In VPA I do use both - crazy as it may seem once you get used to > think on it it is no longer an issue. > But overall having big endian bit numbering on a big endian machine > is the correct thing to do. If one chooses to use a power core in > little endian mode much of the time one will be somewhat screwed > I suppose.
To me, "bit 0" is always the least significant bit regardless of the endianness of the chip. But I try not to start wars over it :-)
> >> ... But >> it is a very complex architecture - for smaller devices (say 200 MHz, >> single core - microcontroller class cpus) a PPC core will be much >> bigger, more difficult to design and work with, and take more power than >> an ARM (or MIPS) core. > > Oh I agree power does not make sense on the smallest of MCUs, of course. > In fact I don't think it makes much sense below a megabyte of RAM > or so (but then that's me and is based on my needs so far, I am not > claiming this to be some general rule).
Fair enough.
> >>> Not at all. You can always save/restore even just 1 register if you >>> want to on a 32 register machine, what you *cannot* do on ARM is >>> have more than 13 registers to use in an IRQ handler when you need >>> them - so you will need *more* memory accesses in this case. >> >> That makes no sense to me. > > Well I really cannot simplify the concept of saving say 4 out of 32 > registers, using only them in an IRQ handler, then restoring only > them and returning from the exception.
When your interrupt function is a leaf function, it's no problem to only save the registers you need (regardless of the size of the register set). But when it is not a leaf function, and you are calling other functions (whose code you do not know), you have to follow the calling conventions and assume that the called function will destroy all "volatile" registers - that's at least 11 extra register saves in a PPC EABI system, as well as the link register, CCR, etc.
> >>> And since the latency which matters is the IRQ latency (tasks get >>> switched once in milliseconds, IRQ latencies may well have to be >>> in the few uS range) - the time on save/restore all registers >>> is negligible (e.g. 32 2.5nS cycles once per mS on a 400 MHz power >>> core, or 0.008% of the time). >>> >> >> The time needed to save and restore all registers may not be relevant in >> a given application, but it is not irrelevant or negligible in all cases. >> > > So show us one such case.
On one PPC system I worked with which did not support individual interrupt vectors, /all/ registers were saved (and later restored) because the vector code was calling unknown external code. 32 x 32-bit general purpose registers plus 32 x 64-bit floating point registers, stored on external SRAM with 2 cycle accesses meant 400 cycles just for the register save and restore - not including the memory bandwidth for reading the code or any other processing time. Even if I had spend time optimising it by limiting the storage to the volatile registers (since I knew the interrupt functions all followed the EABI conventions correctly), and therefore halved the overhead, it would still have been very large.
> > Dimiter > > ------------------------------------------------------ > Dimiter Popoff, TGI http://www.tgi-sci.com > ------------------------------------------------------ > http://www.flickr.com/photos/didi_tgi/ > > >
On 09/01/15 16:30, Vladimir Ivanov wrote:
> > On Fri, 9 Jan 2015, David Brown wrote: > >> On 09/01/15 10:54, Vladimir Ivanov wrote: >>> >>> On Fri, 9 Jan 2015, David Brown wrote: >>> >>>> For microcontrollers, such as the Cortex M devices, I think 16 >>>> registers >>>> is a good balance for a lot of typical code. >>> >>> In Thumb2 you work directly with 8 GP registers, indirectly with few >>> like PC and SP, and accessing the rest of the GPRs is different and/or >>> has penalties. >> >> As far as I understand it, accessing the other registers means 32-bit >> instructions rather than the short 16-bit instructions. So accessing >> them has penalties compared to accessing the faster registers, but not >> compared to normal ARM 32-bit instructions. > > Yes, longer code sequences, and most likely very limited instruction > forms. The latter leads to shuffling of data between the regular 8 GPRs > and the other, "unregular" GPRs. >
That was needed for Thumb, but not for Thumb2 - you simply use the 32-bit instructions and have access to the same registers as you would with 32-bit ARM codes. If you like, you can think of Thumb2 as being mostly the same as 32-bit ARM (losing a little barrel shifter and conditional execution capability) with the addition of 16-bit "short-cuts" for the most commonly used instructions.
>>> Just trying to say that it is a moot point. And personally, I never >>> understood the existence of Cortex-M - why cripple the ability to switch >>> to native 32-bit mode, if most or all of the underlying logic is there? >> >> My knowledge of the details is weak, but AFAIK the only thing you really >> lose with Thumb2 compared to ARM instruction sets is the conditional >> execution flags - with ARM, you can use the flags with most >> instructions, while with Thumb2 you have the if-then-else construction. >> (You also lose the barrel shifter on some instructions, but that is not >> going to affect much code.) > > I am not Thumb2 expert, either. As a very strong personal (biased) > opinion, I don't find it elegant at all. MIPS16e impressed me bit more > with their EXTEND instruction.
I haven't studied either Thumb2 or MIPS16e (or PPC VLE) in detail, but they all seem to be a similar solution to a similar problem - making a variable-length encoding scheme that is easy to decode, keeps common instructions short, but makes it easy to access the full range of the cpu's abilities.
> > What I am trying to communicate, is that the CPU core with all the > blocks is there. Thumb2 is more or less a decoder, just like the ARM > mode is. Same with MIPS32 and MIPS16e. Why would one cripple something > by removing one of the decoders? The power savings are negligible. > > ARM7TDMI was more balanced in that regard.
No, the original Thumb instruction set only gave access to some of the cpu and let you write significantly slower but more compact code than full ARM. That's why they had to keep the ARM decoder too - if you needed fast code, you had to use the full instruction set. And no one considered the mix of two instruction sets to be "balance" - polite people called it a pain in the neck. Thumb2 lets you write code that is about 60% of the size of ARM code, and is often /faster/ than 32-bit ARM code, since you can get almost all of the functionality while being more efficient on your memory bandwidth and caches.
> >> With the original Thumb, ARM kept the normal 32-bit ARM ISA as well >> because for some types of code it could be significantly faster. But >> with Thumb2, there is almost no code for which the full 32-bit ARM >> instructions would beat the Thumb2, taking into account the memory >> bandwidth benefits of Thumb2. > > Any pointers to data showing this? Never heard of it so far, and does > not reflect my experience. > > Why'd they include ARM mode at all in the Cortex-A series? :-)
For backwards compatibility. In Cortex M applications, code is generally compiled specifically for the target - so there is no need for binary compatibility. But for Cortex A systems, you regularly have pre-compiled code from many sources, and binary compatibility with older devices is essential.
On 10.1.2015 &#1075;. 20:07, David Brown wrote:
> On 09/01/15 23:04, Dimiter_Popoff wrote: >> On 09.1.2015 &#1075;. 11:47, David Brown wrote: >>> On 09/01/15 00:22, Dimiter_Popoff wrote: >>>> On 09.1.2015 &#1075;. 00:53, Wouter van Ooijen wrote: >>>>> Dimiter_Popoff schreef op 08-Jan-15 om 11:18 PM: >>>>>> On 08.1.2015 &#1075;. 23:25, David Brown wrote: >>>>>>> ... (Just as "cpus should >>>>>>> have more than 16 core registers" is a good reason for disliking >>>>>>> ARM's, >>>>>>> if that is your opinion.) >>>>>> >>>>>> Results of arithmetic calculations are not exactly what I would >>>>>> call an >>>>>> opinion. >>>>>> 16 registers - one of which being reserved for the PC - are too few >>>>>> for >>>>>> a load/store machine. >>>>> >>>>> 16 is a fact, but the rest is nothing but opinion. >>>>> >>>>> > Clearly it will work but under equal conditions >>>>>> will be slower than if it had 32 registers, sometimes much slower. >>>>> >>>>> More registers can be slower too. >>>> >>>> Yes, 32 is about the optimum I suppose. But I have not really analyzed >>>> that, what I have - and demonstrated by an example which you chose to >>>> ignore - is the comparison 32 vs. 16. >>> >>> Certainly it is possible to pick examples where 32 registers is more >>> effective than 16 - but equally we can pick examples where 16 registers >>> is more efficient (such as context switching, or code with a lot of >>> small functions). Examples are illustrative, but not proof of a general >>> rule. >> >> You have yet to prove this point. Context switching is not a valid >> example, as I explained in my former post which you must have read >> prior to replying to (for those who have not, context switching is >> responsible for a fraction of a percent of CPU time, consequently >> halving or even completely eliminating that can bring a fraction of >> a percent improvement, i.e. it is negligible). > > I think it would be wrong to talk about /proving/ points here
> ... OK, I really meant "make" your point. Though in technical terms making a point which cannot be proven is fairly pointless...
> Context switching is not a valid example for /you/, based on the figures > /you/ gave.
So you understand that these figures are correct - but imply that the example applies just to me. I thought we could agree at least on the meaning of numbers.
>I have written systems that had timer interrupts at 100,000 > times per second
Which has *nothing* to do with context switching, if you do that and save all registers instead of the minimum you have to you just don't know what you are doing. I know from threads from years past that you tend to mix up task scheduling and interrupt processing but please understand that there is a world of a difference between interrupt processing and a task switch initiated by an interrupt. You have not made a point - context switching is *not* a case where 32 registers can be worse off than 16 in a non-negligible way (negligible meaning performance cost within say 0.1%, latency-wise same as 16 or better). You have yet to give a valid example for what you claim.
> In general, in bigger systems (and PowerPC cores tend to be used in > bigger systems than most of the embedded cores we see here) you try to > avoid many interrupts, and prefer DMA and more sophisticated peripherals > to keep the interrupt rate low.
This is wrong, it is not true that on larger systems interrupts must be avoided. You should understand that there is no such animal as "in general" in engineering. Things we make have to *work*, so we have to go down to the details. For example, on an mpc5200b based system, the one for my everyday use for programming, emailing etc., running DPS of course, I have plenty of interrupts all the time - from the display controller vertical retrace, from the two PS2 ports where the mouse and the keyboard are connected - each PS2 clock causes and interrupt and yes, they can come at a faster rate than every 10uS - then there are the ATA interface interrupts etc. etc. These have *nothing* to do with task switching, neither do they initiate one. Say the decrementer interrupt might initiate one and then of course all regisrters will be shuffled - but *AFTER* the interrupt has been served and unmasked again so that the interrupt latency stays low (so a 10uS IRQ rate cannot not impress the machine). Whatever example you try to come up with you will never find one which requires saving *all* of the registers with the interrupts masked - which means there is no performance advantage in having 16 rather than 32 registers, while the opposite is true most if not all of the time.
> Other than interrupt context switches, there is also function call > overhead.
Same concept, save/restore only what needs to be saved/restored. Fewer registers will be as good only as long as you do not need more registers, then you have to save/restore more *because* you have fewer registers. I explained that once already and I am doing it again for you, please do not make me do it for a third time. Just think and be willing to understand the obvious.
> I don't know what sort of calling conventions you use in your > VPA, but for high-level languages there is normally a defined convention > with some registers being caller-saved (or "volatile") and others being > callee-saved (or "non-volatile"). The split can vary between compilers
> ... This is irrelevant, we are comparing cores. Whether this or that compiler got some of it basics right or wrong has nothing to do with it. The fundamental principle - "save/restore only what you have to" applies in all cases of programming. IOW if you waste resources by saving/restoring more than you have to you are doing only a little better than masking all interrupts and jumping into a " bra *" loop; the ways to destroy something working are probably infinite, these are only two of them.
>> I would certainly be interested in the example you claim to be able >> to produce demonstrating how 16 registers can be more effective. >> My example covers a typical, widespread application - that of a FIR. >> Let us see yours. > > I can't produce a /function/ that is more efficient with 16 registers > than 32 registers - but I hope that above I have explained how it make a > difference with chains of small functions (or functions with few > register demands).
I know you cannot produce an example - the above (much of it clipped) was irrelevant.
> I think it is reasonable to say that as programs (in the embedded world) > have got bigger, memories have got bigger, and compilers have got > better, then the balance for many systems has moved more towards 32 > registers rather than 16 registers, in the same way that it has moved > towards 32-bit cpus from 8 and 16-bit cpus.
You still do not get it, do you. 32 registers are not just *better*, they are a necessity on a load/store machine with a pipeline (deep something like 5-6 stages). You just cannot keep the pipeline full if you have only 16 registers without stalls because of data dependencies.
> Beyond that, I would say that even if 16 registers is not /more/ > efficient than 32 registers, I have yet to see any reasoning for why 32 > registers is /significantly/ more useful for the type of code generally > seen in microcontrollers (and yes, I have no choice here but to talk of > generalities).
My FIR example demonstrated what is about a 3-fold improvement, and above I just explained - *again* - why. It should be easy to see how this applies by far not just to FIR but to any computationally intensive algorithm where data dependencies would kick in.
> A filter algorithm might run faster with more registers > - but it would do even better by using additional DSP-specific support
Obviously hardware other than a general purpose core can be built to do things the core cannot do. We are comparing the cores here.
> On one PPC system I worked with which did not support individual > interrupt vectors, /all/ registers were saved (and later restored) > because the vector code was calling unknown external code.
Poor programming. IRQ routines may never call unknown external code. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
Wouter van Ooijen <wouter@voti.nl> wrote:

(snip)
>> Simon Clubley <clubley@remove_me.eisner.decus.org-earth.ufp> wrote: >>> On 2015-01-09, Dimiter_Popoff <dp@tgi-sci.com> wrote:
(snip)
>>> In the general case, you have to push and pop all the registers every >>> time you take an interrupt.
(snip,then I wrote)
>> Some processors have multiple register sets that might avoid that. >> SPARC has register windows, such that they don't have to save to >> memory until all the windows are in use. I don't know if that is >> for interrupts, too.
> That is a nice concept, but it has two problems:
> - when you overflow the avaiable resister sets, you must spill to > memory. whether you need to do this, depends on where you are in the > register set. this makes timing difficult to predict, which is not nice > for a real-time system
Seems that they thought of that.
> - imagine a context switch. now you have to save/restore all > register sets!
and that.
> In general, fat-context CPUs are better at single-threaded no-interrupt > applications, but worse for switch-often interrupt-heavy applications.
http://www.gaisler.com/doc/sparcv8.pdf Section D.8 explains that one. For the more usual case where function calls are much more common than context switch, and you want to minimize total overhead, more register windows are better. For SPARC, there is a register mask that controls how many are used by user mode code. Changing that allows supervisor code to use some, such that one can optimize overall between user and supervisor code. And finally, they consider the case where context switch time is most important. In that case, they allocate register windows between tasks, such that one can do task switch by only changing the register mask and current window, and not storing any to memory. The OS has complete control over how the windows are used. -- glen
On 2015-01-10, Dimiter_Popoff <dp@tgi-sci.com> wrote:
> > I am still somewhat amazed at what was said.
That just shows the disconnect between what you do and what the rest of the world does. :-)
> What on earth is there to stop people from doing something > *that* simple - here are two IRQ handlers on a small MCU, I used > an mcf52211 a couple of months back to make a HV source - it > does the PWM/regulating, overcurrent protection/limiting, serial > communication etc., all in all 4 tasks, several IRQs. Took me > about 2 weeks to program (I had hoped it would take 2 days but > I had completely forgotten the insides of the 52211 so I had to > recall a lot which is where the 2 weeks went). A total of about > 250k sources, the object code being almost 9 kilobytes. > > So here are two IRQ handlers where hopefully it is obvious how > only what is needed is saved and restored, pretty basic stuff: > > http://tgi-sci.com/misc/hvst0q.gif > > This is not VPA, just plain 68k (well, CF) assembly, so it is > far from being taken from my own world. >
While the actual wrapper around the device specific interrupt handler (ie: the generic IRQ code which runs before you get to the device specific handler itself) is generally still assembly language, most people don't write the actual device specific handler in assembly language any more, but use a higher level language such as C instead. Once you do that, the IRQ wrapper needs to save all the registers the C compiler could potentially use, including all the temporary registers, before the wrapper calls the device specific handler.
> I just wonder how hopeless things must have become to question > the viability of doing something that basic. >
Different set of tradeoffs. The higher level language code can potentially be reused on multiple architectures (or if that's not possible in a specific case, can at least used as the starting point for another driver); your PowerPC specific assembly language code cannot be reused in such a way. What works for you in your restricted environment doesn't work when you need your code to work in a generic environment across a wide range of architectures. Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world
On 2015-01-09, Anders.Montonen@kapsi.spam.stop.fi.invalid <Anders.Montonen@kapsi.spam.stop.fi.invalid> wrote:
> > microMIPS has instructions for pushing and popping the callee-save > registers onto the stack (LWM32/LWM16/SWM32/SWM16). This is notable in a > way because MIPS have traditionally avoided committing an ABI into the > architecture. >
I didn't know that. Thanks. Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world

Memfault Beyond the Launch