EmbeddedRelated.com
Forums

Integrated TFT controller in PIC MCUs

Started by pozz January 7, 2015
On 13.1.2015 г. 10:53, David Brown wrote:
> On 13/01/15 03:12, Dimiter_Popoff wrote: > >> So eventually - a few tons of ink later - you also accept that it is not >> necessary to save all 32 registers of a 32 register core in an IRQ >> handler thus there is no advantage whatsoever in having only 16 >> registers - which was the whole point of the discussion. >> Well better late than never :-). > > If I can summarise the arguments here, everyone accepts that you don't > have to save more registers than you need,
Thanks God, the first two tons of ink seem to have worked eventually. You claimed exactly the opposite for a long time.
> ... and (baring unusual cases) > you only have to save /all/ registers during a task context switch. But > it is common to have to save all "volatile" registers, of which there > are more in PPC and MIPS than ARM - when you have more registers in the > cpu, you /will/ do more unnecessary register saves and restores.
This is where the next few tons of ink will have to go apparently. What on Earth makes you think having 32 registers rather than 15 makes you have more volatile registers. Starting to spend the third ton of ink: you only have to save the registers which you use. There is nothing stopping you from using only 3-4 registers in an interrupt handler thus saving only 3-4 registers on either machine. If the third ton of ink does not make that clear for you please recycle back to the first 2 tons, let us be environmentally friendly.
>> Why 32 registers are a must on a load/store machine with a reasonably >> deep pipeline I already explained; thus my point that ARM with its >> 15 GPR-s is a crippled load/store architecture stays valid. > > This is, I think the more interesting point, which I do not believe has > been covered properly. It is clear for any given function, having more > registers is not give slower code than having fewer registers, all other > things being equal. But will more registers give /significantly/ faster > code? If so, under what circumstances is that the case?
I already explained that - when you have data dependencies. The FIR implementation is a classic example of that. Everything else being equal if you have only 15 registers the 6-stage pipeline will stall about 2/3 of the time, check the former ton of ink we spilled.
>... Deep pipeline superscaler processors invariably > have register renaming, which obsoletes the need for many visible > registers.
They do have that and it does not obsolete the need in question. It saves you from unnecessary serializations, yes, but it does not help against data dependencies - which is what makes 15 registers too few for a load/store machine (unless it is non-pipelined, which is how at least initally I am sure ARM have been, but this is even more crippling).
> Thus I think you are a very long way from being justified in claiming > that the 16 registers in ARM make the architecture "crippled". There is > certainly code for which 32 registers works better than 16 even when you > have renaming, especially on larger processors, because you want to > refer to more data at a time without having to reference the stack or > other memory data. But that doesn't make the 16-register ARM "crippled".
So eventually you do understand that having 32 registers makes the (load/store) machine more efficient by definition. My FIR example demonstrated this can be up to a few *times* more efficient. And yet you call an architecture which is crippled by design - being unable to keep up with the one compared to simply because it has been designed as it is - non-crippled. Well your choice of words does not alter the reality - which is that you just cannot design in 15 registers the equivalent of a load/store machine with 32 registers. You can build hardware around that, Intel do that for ages to keep their even more crippled x86 model alive, but you can build hardware to do about anything (we covered that, too, so hopefully we will not go there again). Clearly initially ARM has been designed saving on design resources - time, designer skill - to have something working to sell. Performance-wise its architecture is dramatically inferior to power exactly because they made it with only 16 registers, perhaps targeting it at small, low power applications. It has been superior to power for the smallest of applications obviously (like in the first phones) but when it comes to performance it is what it is. Notice "crippled" does not mean unusable; it only means that under equal conditions using ARM rather than power (for large enough systems, we covered that already, say 1M+ RAM) ARM will be at a significant disadvantage, up to a few times slower. Of course certain tasks can be done by the crippled CPU no slower, it is just that the opposite is never the case. Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
Dimiter_Popoff <dp@tgi-sci.com> wrote:
> On 13.1.2015 &#1075;. 09:33, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:
>> The CPU reserves space in the stack frame for the caller-saved FPU >> registers ^^^^^^^^^^^^^^^^^^^^^^^^ >> ^^^^^^^^^ > This for all FPU registers?
No. The point of the Cortex-M automatic interrupt prologue is to allow ISRs to be normal C functions without any assembly glue.
> (probably you can switch it off?)
It is optional. -a
In article <m92md7$j6q$1@dont-email.me>, david.brown@hesbynett.no 
says...
.....
 
> For interrupts, function calls and context switches, it seems clear that > 32 registers involves more saves and restores than 16 registers, but > there is no convincing argument that this difference is relevant. This > round is a tie in the great 16-reg vs. 32-reg battle, and we should move > on to more interesting points.
The whole 32 V 16 register 'debate' has been trying to see "how many fairies fit on on a pinhead" type of discussion. All based on the the types of applications individual posters normally write. We have no idea what type of application or even range of applications the processor is for, let alone what type of processing is required. Personally observations on following might have been more useful 1/ Package type options for precessor and other compiler support merits 2/ merits of TFT controller flexibility 3/ What type of things he will do with TFT and if the UI has to have animated or moving widgets, or even phone style windowing and wipe effects was more important. 4/ Does the TFT controller have its own frame buffer(s) and their limits 5/ Does it have hardware assist or rely on memory to memopry DMA for copying screens or bits of screens 6/ Graphical library suppoort and limitations (there was a bit early on) If the application is going to be busy doing lots of memory moves and the TFT controller accessing shared memory, that is going to have bigger load on the application than most other things, in the MAJORITY of applications. -- Paul Carpenter | paul@pcserviceselectronics.co.uk <http://www.pcserviceselectronics.co.uk/> PC Services <http://www.pcserviceselectronics.co.uk/pi/> Raspberry Pi Add-ons <http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font <http://www.badweb.org.uk/> For those web sites you hate
David Brown schreef op 13-Jan-15 om 9:53 AM:
> On 13/01/15 03:12, Dimiter_Popoff wrote: > >> So eventually - a few tons of ink later - you also accept that it is not >> necessary to save all 32 registers of a 32 register core in an IRQ >> handler thus there is no advantage whatsoever in having only 16 >> registers - which was the whole point of the discussion. >> Well better late than never :-). > > If I can summarise the arguments here, everyone accepts that you don't > have to save more registers than you need, and (baring unusual cases) > you only have to save /all/ registers during a task context switch. But > it is common to have to save all "volatile" registers, of which there > are more in PPC and MIPS than ARM - when you have more registers in the > cpu, you /will/ do more unnecessary register saves and restores. > Opinions differ wildly on the significance or importance of this. > > For interrupts, function calls and context switches, it seems clear that > 32 registers involves more saves and restores than 16 registers, but > there is no convincing argument that this difference is relevant. This > round is a tie in the great 16-reg vs. 32-reg battle, and we should move > on to more interesting points. > > >> Why 32 registers are a must on a load/store machine with a reasonably >> deep pipeline I already explained; thus my point that ARM with its >> 15 GPR-s is a crippled load/store architecture stays valid. > > This is, I think the more interesting point, which I do not believe has > been covered properly. It is clear for any given function, having more > registers is not give slower code than having fewer registers, all other > things being equal.
I disagree. Registers are not free: the cost die space, power, and probably most important: bits in the opcode. Other things being equal (and the instruction bandwith being a limit) more register means less bits for orther things, with the postential for slower code. > But will more registers give /significantly/ faster
> code? If so, under what circumstances is that the case? And how does > it compare to using the same hardware space and/or opcode instruction > space for other features? > > When you have a deep pipeline and superscaler execution (which is not > the case for most microcontroller cpus), you have to have a lot of data > passing through the core to make full use of it, and lots of data "in > flight" at a time. And since data has to pass through registers, that > means lots of registers. But does that mean needing lots of /visible/ > registers in the ISA? Deep pipeline superscaler processors invariably > have register renaming, which obsoletes the need for many visible > registers. > > Without register renaming, you need to "manually" (i.e., either the > assembly programmer or the compiler, rather than the cpu itself) assign > registers in order to schedule and interleave reading new data in, doing > calculations, and writing out the results to maximise the throughput - > your aim is to avoid the key execution units having to wait for incoming > data. But with register renaming, you can use the same register names > all the way - the cpu handles the renaming and scheduling. The result > is that the code is smaller, simpler, clearer, and more efficient for > caching (especially if the cpu has a super-fast cache for small loops). > > > So if you have a PPC core such as the e200z7, with a 10-stage pipeline > and dual issue execution unit, but no register renaming, you need more > than 16 named registers to keep the execution units busy in hard > calculations. But on a small ARM (Cortex-M3/M4) with a single-issue cpu > and a three stage pipeline, 16 registers is sufficient. And on a large > ARM (Cortex-A) with a multiple issue, deep pipeline core, 16 /named/ > registers is /still/ sufficient because there are a large number of > /unnamed/ registers for remapping. > > > Thus I think you are a very long way from being justified in claiming > that the 16 registers in ARM make the architecture "crippled". There is > certainly code for which 32 registers works better than 16 even when you > have renaming, especially on larger processors, because you want to > refer to more data at a time without having to reference the stack or > other memory data. But that doesn't make the 16-register ARM "crippled". > >> >> I never wanted to go into deeper detail on what this or that >> particular core does right or wrong, the whole point was the basic >> 32 vs. 16 (15 on ARM really) GPR-s. >> >> No, there is not manual available for VPA at the moment as there are >> no machines on the market runnning DPS other than our spectrometry >> devices. Once I decide to make DPS, VPA and the whole thing separately >> marketable to compete with MS, linux and the like I will announce it >> loudly enough I suppose. >> >> Dimiter >> >> ------------------------------------------------------ >> Dimiter Popoff, TGI http://www.tgi-sci.com >> ------------------------------------------------------ >> http://www.flickr.com/photos/didi_tgi/ >> >
On 2015-01-13, Dimiter_Popoff <dp@tgi-sci.com> wrote:
> On 13.1.2015 &#1075;. 10:53, David Brown wrote: >> ... and (baring unusual cases) >> you only have to save /all/ registers during a task context switch. But >> it is common to have to save all "volatile" registers, of which there >> are more in PPC and MIPS than ARM - when you have more registers in the >> cpu, you /will/ do more unnecessary register saves and restores. > > This is where the next few tons of ink will have to go apparently. > What on Earth makes you think having 32 registers rather than 15 > makes you have more volatile registers.
Because it depends on the ABI in use.
> Starting to spend the third ton of ink: you only have to save the > registers which you use. There is nothing stopping you from using > only 3-4 registers in an interrupt handler thus saving only 3-4 > registers on either machine. If the third ton of ink does not > make that clear for you please recycle back to the first 2 tons, let > us be environmentally friendly. >
If you use an ABI in which most of the 32 registers are callee saved or write your device specific handler in assembly language and hence have direct control over the registers in use, then you are correct. If you use a higher level language to write your handler and the ABI in use states around half of those registers are caller saved, then, in the general case, your IRQ wrapper must save those registers before it calls that handler because the compiler will generate code which conforms to that ABI. These days, most people write their drivers in a higher level language such as C and code from different people/teams has to work together so the compiler must conform to the ABI in use. This means that, in the general case, if your ABI requires the caller to save (say) ~16 registers out of the 32 registers but the code generated by the compiler for a specific driver only uses 6 of the caller saved registers, then those ~16 registers still need to be saved because the wrapper doesn't know any different. The upside is that you get a general purpose ABI in which everyone's higher level language code can work together. Simon. -- Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world
Op 13-Jan-15 15:31, Wouter van Ooijen schreef:
> David Brown schreef op 13-Jan-15 om 9:53 AM: >> On 13/01/15 03:12, Dimiter_Popoff wrote: >> >>> So eventually - a few tons of ink later - you also accept that it is not >>> necessary to save all 32 registers of a 32 register core in an IRQ >>> handler thus there is no advantage whatsoever in having only 16 >>> registers - which was the whole point of the discussion. >>> Well better late than never :-). >> >> If I can summarise the arguments here, everyone accepts that you don't >> have to save more registers than you need, and (baring unusual cases) >> you only have to save /all/ registers during a task context switch. But >> it is common to have to save all "volatile" registers, of which there >> are more in PPC and MIPS than ARM - when you have more registers in the >> cpu, you /will/ do more unnecessary register saves and restores. >> Opinions differ wildly on the significance or importance of this. >> >> For interrupts, function calls and context switches, it seems clear that >> 32 registers involves more saves and restores than 16 registers, but >> there is no convincing argument that this difference is relevant. This >> round is a tie in the great 16-reg vs. 32-reg battle, and we should move >> on to more interesting points. >> >>> Why 32 registers are a must on a load/store machine with a reasonably >>> deep pipeline I already explained; thus my point that ARM with its >>> 15 GPR-s is a crippled load/store architecture stays valid. >> >> This is, I think the more interesting point, which I do not believe has >> been covered properly. It is clear for any given function, having more >> registers is not give slower code than having fewer registers, all other >> things being equal. > > I disagree. Registers are not free: the cost die space, power, and > probably most important: bits in the opcode. Other things being equal > (and the instruction bandwith being a limit) more register means less > bits for other things, with the potential for slower code.
Like almost everything in engineering it is a trade off. The number of register needed to accomplish a task efficiently also depends on other aspects of the ISA. For example with an ISA with more sophisticated addressing modes one may need less registers than with a minimalistic RISC ISA. Many modern (superscalar) processors have internally more registers than exposed via the ISA, register renaming technique reduces the chance that registers become a performance bottleneck. With the x86 64-bit instruction set its designers choose to expand the number of general purpose registers from 8 to 16. They could have easily chosen a larger number of registers but apparently their analysis showed that the benefit of more registers did not outweigh the downsides. I'd say that it is a bit too simplistic to state that a ISA that has only 15 GP registers must be crippled. I think this discussion about the optimum number of processor would be more appropriate in comp.arch were the people are that are/were involved with processor design.
On 13.1.2015 &#1075;. 22:00, Simon Clubley wrote:
> On 2015-01-13, Dimiter_Popoff <dp@tgi-sci.com> wrote: >> On 13.1.2015 &#1075;. 10:53, David Brown wrote: >>> ... and (baring unusual cases) >>> you only have to save /all/ registers during a task context switch. But >>> it is common to have to save all "volatile" registers, of which there >>> are more in PPC and MIPS than ARM - when you have more registers in the >>> cpu, you /will/ do more unnecessary register saves and restores. >> >> This is where the next few tons of ink will have to go apparently. >> What on Earth makes you think having 32 registers rather than 15 >> makes you have more volatile registers. > > Because it depends on the ABI in use.
This is at least the third time I explain this to you but I don't mind, I'll do it as many times as it takes: there are many ways to destroy something working other than inept programming, some of them much easier. So what is the guaranteed IRQ latency on your ARM core of choice running linux with some SATA drives, multiple windows, ethernet, some serial interfaces. Try to give some figure - please notice the word "guaranteed", I know how much the linux crowd prefers to talk "in general". Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/
On 13/01/15 21:57, Dombo wrote:
> Op 13-Jan-15 15:31, Wouter van Ooijen schreef: >> David Brown schreef op 13-Jan-15 om 9:53 AM: >>> On 13/01/15 03:12, Dimiter_Popoff wrote: >>> >>>> So eventually - a few tons of ink later - you also accept that it is >>>> not >>>> necessary to save all 32 registers of a 32 register core in an IRQ >>>> handler thus there is no advantage whatsoever in having only 16 >>>> registers - which was the whole point of the discussion. >>>> Well better late than never :-). >>> >>> If I can summarise the arguments here, everyone accepts that you don't >>> have to save more registers than you need, and (baring unusual cases) >>> you only have to save /all/ registers during a task context switch. But >>> it is common to have to save all "volatile" registers, of which there >>> are more in PPC and MIPS than ARM - when you have more registers in the >>> cpu, you /will/ do more unnecessary register saves and restores. >>> Opinions differ wildly on the significance or importance of this. >>> >>> For interrupts, function calls and context switches, it seems clear that >>> 32 registers involves more saves and restores than 16 registers, but >>> there is no convincing argument that this difference is relevant. This >>> round is a tie in the great 16-reg vs. 32-reg battle, and we should move >>> on to more interesting points. >>> >>>> Why 32 registers are a must on a load/store machine with a reasonably >>>> deep pipeline I already explained; thus my point that ARM with its >>>> 15 GPR-s is a crippled load/store architecture stays valid. >>> >>> This is, I think the more interesting point, which I do not believe has >>> been covered properly. It is clear for any given function, having more >>> registers is not give slower code than having fewer registers, all other >>> things being equal. >> >> I disagree. Registers are not free: the cost die space, power, and >> probably most important: bits in the opcode. Other things being equal >> (and the instruction bandwith being a limit) more register means less >> bits for other things, with the potential for slower code. > > Like almost everything in engineering it is a trade off. The number of > register needed to accomplish a task efficiently also depends on other > aspects of the ISA. For example with an ISA with more sophisticated > addressing modes one may need less registers than with a minimalistic > RISC ISA. Many modern (superscalar) processors have internally more > registers than exposed via the ISA, register renaming technique reduces > the chance that registers become a performance bottleneck. With the x86 > 64-bit instruction set its designers choose to expand the number of > general purpose registers from 8 to 16. They could have easily chosen a > larger number of registers but apparently their analysis showed that the > benefit of more registers did not outweigh the downsides. I'd say that > it is a bit too simplistic to state that a ISA that has only 15 GP > registers must be crippled.
It's useful to make the distinction between /named/ registers (exposed in the ISA to the programmer) and /unnamed/ registers (implementation dependent, internal registers for register renaming). When designing the amd64 ISA, the AMD folks, working tightly with gcc developers, Linux kernel developers, and presumably many other people, concluded that 16 named GP registers was the right balance for the architecture. It was long established that the 8 registers of x86 was too few, but as you say their analysis did not show much benefit of more than 16 registers - and the disadvantages (opcode space, and extra register stores in function calls) outweighed any advantage. Internally, implementations of amd64 might have hundreds of unnamed GP registers. Also note that the amd64 architecture has lots of SIMD registers as well as GP registers. I think in most examples where large numbers of GP registers would help, SIMD registers are a better solution - and are therefore implemented on most fast cpu designs. Finally, the discussion was centred on load-store architectures such as ARM, MIPS and PPC. x86/amd64 are not load-store, and can do more with fewer named registers. Dimiter's assertion was that a load-store architecture is inherently crippled if it has only 16 registers - he has not commented on CISC architectures. A more relevant example is the 64-bit ARM architecture - which has 32 GP registers. That does not in any way prove that the old 32-bit ARM was "crippled" with only 16 registers - but it does show that for such a large processor, the extra registers give a positive trade-off.
> > I think this discussion about the optimum number of processor would be > more appropriate in comp.arch were the people are that are/were involved > with processor design. >
On 14/01/15 02:11, Dimiter_Popoff wrote:
> So what is the guaranteed IRQ latency on your ARM core of choice > running linux with some SATA drives, multiple windows, ethernet, > some serial interfaces. Try to give some figure - please notice > the word "guaranteed", I know how much the linux crowd prefers > to talk "in general".
Having L1/L2/L3 caches will instantly introduce a high variation between the mean and max latencies. Even for i486s with their minimal cache and no operating system, a 10:1 variability was visible. Any variability to do with register saving will be completely insignificant compared to the effects of caches. Unless, of course, you are having to dump the entire hidden state of an Itanic processor :)
On 14.1.2015 &#1075;. 13:42, Tom Gardner wrote:
> On 14/01/15 02:11, Dimiter_Popoff wrote: >> So what is the guaranteed IRQ latency on your ARM core of choice >> running linux with some SATA drives, multiple windows, ethernet, >> some serial interfaces. Try to give some figure - please notice >> the word "guaranteed", I know how much the linux crowd prefers >> to talk "in general". > > Having L1/L2/L3 caches will instantly introduce a high variation > between the mean and max latencies. Even for i486s with their > minimal cache and no operating system, a 10:1 variability was > visible.
Yes, though on some processors one has the ability to lock part of the L1 cache - which allows to have it dedicated to interrupts which can make things a lot tighter (by saving the necessity to update entire cachelines). Overall the latency variability obviously increases as processor sizes increase but then total execution times decrease, memories get faster etc. so the worst case latency can still be very low. On the 5200b which I use I have never needed to resort to any cache locks etc., all I do is just stay masked only as absolutely necessary.
> Any variability to do with register saving will be completely > insignificant compared to the effects of caches. Unless, of > course, you are having to dump the entire hidden state of > an Itanic processor :) >
Well we have not come to that obvious point yet I am afraid :-). Let us first have the figure on the worst-case linux IRQ latency I asked for then put into its context the try of ARM/linux devotees about lower latency by not having enough registers :-). Dimiter ------------------------------------------------------ Dimiter Popoff, TGI http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/