EmbeddedRelated.com
Forums
Memfault Beyond the Launch

What application requires 500MHz for embedded processors

Started by jade March 5, 2006
Wilco Dijkstra wrote:
> "larwe" <zwsdotcom@gmail.com> wrote in message > news:1141680159.304392.46030@j33g2000cwa.googlegroups.com... > > > > Wilco Dijkstra wrote: > > > >> Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12% > >> faster than a C54xx running at the same frequency. > > > > What are the MIPS/mW figures like? > > According to TI numbers the lowest power C54xx uses around 0.6mW/Mhz > (160Mhz max, not sure what process). It's measured as 50% NOPs, 50% > MACs (I'm not convinced that is typical). ARM11 uses 0.8mW/Mhz > (500Mhz at 130nm, with IEM it becomes 0.5mW/Mhz). So the C54 has > ~16% advantage in energy per task. > > So general purpose processors are competitive but not quite there yet. > A superscalar CPU with a lower maximum frequency will be more power > efficient than an ARM11 while achieving similar performance. It will be > interesting to see how good Cortex-R4 turns out... > > Wilco
While it is true that general purpose processors give more and more performance, to be able to compare two particular architectures takes a lot more knowledge than there is in the parametrics lists. This kind of general talk using numbers in no context is in itself quite common - has been for decades - but can be misleading to beginners, so be wanrned :-). Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------
Wilco,
here we go in a detailed comparison... It may be tougher to do than
a few postings exchanged, but having warned the beginners to beware
reading our stuff and think on their own, let's give it a try.


> An ARM11 can't execute 3 independent loads/stores per cycle, > every cycle. But it can read or write 64-bits per cycle. In 4 cycles > it can read 4x4 16-bit values from 4 independent addresses.
This is a key issue. You do not need 64 bits accesses, you do need to access a table with coefficients - probably at a static address, but can be lengthy (kilowords), an area with the data which and an area with the filtered results... this makes 3 *independent* accesses per cycle, plus program fetch, plus one DMA moving data using one of these busses (some of the memory allows two accesses per cycle and this is very handy), plus another DMA doing other work. The way you describe the ARM in question, I would say it might be able manage the example - given the external memory interface is fast enough (DDR2 should do, I suppose). But this has to be proven - you know how it is, the devil is in the detail. But then again, I have done the 5420 design 5 years ago, I might have made a different choice today, like I said in another posting, it takes more knowledge than I now have for that ARM to be able to tell. Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message > news:1141689973.771281.79780@i39g2000cwa.googlegroups.com... > > Wilco, > > > >> That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do > >> 2 MACs and read/write 4 16-bit values per cycle including address > >> increment. Which is faster? > > > > I am not intimately familiar with ARM so you might be right after all. > > However, I need to ask: how many address and data busses has > > the ARM in question so it can read/write 3 independent address > > areas simultaneously? What memory interface does it use so it > > can read/write to memory in a single cycle (to all 3 busses)? > > Internally the ARM11 has a L2 sub system with 4 independent > 64-bit buses which can run at full core speed and a 32-bit > peripheral bus. The L1 system is Harvard with separate 64-bit > buses connecting to the core and L2 system. All of these can > work in parallel. However that wasn't what I was talking about. > > An ARM11 can't execute 3 independent loads/stores per cycle, > every cycle. But it can read or write 64-bits per cycle. In 4 cycles > it can read 4x4 16-bit values from 4 independent addresses. > So on streaming data it effectively achieves the same > throughput as 4 independent 16-bit memory accesses without > using XYZ memories. ARM11 can also issue a load multiple > instruction and continue to do computations while the memory > system fetches the data in the background - even on a cachemiss. > > > But to make the comparison fairer, let us compare a practical > > case - something I have done on a 5420. The sampled signal is 14 bits, > > there are 9.2 MSPS continuously running and DMA buffered without > > missing a single sample on a circular queue in the on-chip memory. > > Parallel with the sampling, the DSP does some filtering to recognize > > events using about 90% of its theoretical MAC bandwidth; > > the remaining bandwidth goes on qualifying found events and > > programming yet another of the many DMACs to pass enough of > > the samples surrounding the event to the other DSP core. > > This takes exactly a 100 MHz clocked 54xx core, all memory > > being internal, in fact you can safely claim there is no external > > hardware of interest related to the comparison. > > > > Can you do that with a 500 MHz ARM? I know you cannot do it > > with a 500 MHz PPC which is what I am familiar with (memory > > latencies will kill you). Remember, all this is > > done in real time, no missed samples, no missed events. > > When you add the second core's consumption (also almost 100% > > busy, but this one so only under the toughest conditions), > > you get about 300 mW... > > Yes, from what you say it could do it in 100Mhz. If the ADC has a > FIFO you could soft-DMA the samples directly into local memory > or cache, something like 256 bytes would give an interrupt rate of > 100K, taking about 10MHz. Using the built-in DMA allows a smaller > buffer and has virtually no overhead. If all else fails, the samples > could be written to main memory in 32-byte blocks, as the required > bandwidth of 20MBytes/s is tiny. > > Doing 100M MACs would take less than 100MHz if all data is in > the L1 cache (which can be locked) or local memory. Using SDRAM > or L2 costs a bit more, but the data can be streamed into local > memory using the DMA or using software prefetch. If everything > fits in the caches/local memory doing it in 100Mhz is achievable. > > I can imagine a PPC without Altivec has a problem doing MACs. > Also it may not have good enough cache lockdown/local memory > facilities so worst case latencies may just be too high. I don't know > which PPC you meant, but ARM11 was designed for precisely this > kind of realtime on-chip processing. > > Wilco
"Jim Granville" <no.spam@designtools.co.nz> wrote in message 
news:440e183c$1@clear.net.nz...
> Wilco Dijkstra wrote: >> "larwe" <zwsdotcom@gmail.com> wrote in message >> news:1141680159.304392.46030@j33g2000cwa.googlegroups.com... >> >>>Wilco Dijkstra wrote: >>> >>> >>>>Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12% >>>>faster than a C54xx running at the same frequency. >>> >>>What are the MIPS/mW figures like? >> >> >> According to TI numbers the lowest power C54xx uses around 0.6mW/Mhz >> (160Mhz max, not sure what process). It's measured as 50% NOPs, 50% >> MACs (I'm not convinced that is typical). ARM11 uses 0.8mW/Mhz >> (500Mhz at 130nm, with IEM it becomes 0.5mW/Mhz). So the C54 has >> ~16% advantage in energy per task. >> >> So general purpose processors are competitive but not quite there yet. >> A superscalar CPU with a lower maximum frequency will be more power >> efficient than an ARM11 while achieving similar performance. It will be >> interesting to see how good Cortex-R4 turns out... > > Another thing to watch, is if the values for C54xx include memory > (probably, as it is on chip?) and the ARM11 ones exclude memory > (probably not, as that is off-chip ? ) - so the cores themselves > might be in the same ballpark, but what about the system figures ?
Both include on-chip memory/caches of course. Nobody includes power consumption of external memory as it depends too much on the particular core and memory configuration. If you somehow need to access external memory a lot on a cached core, there is something seriously wrong... There is no independent company doing power consumption benchmarks, something like that is needed. Measuring energy accurately and fairly is difficult... Wilco
Hello Wilco,

The discussion till here seems over my knowledge but I'd like to know
the detail.
Could you please give some guildances where I can get it?

I've pipeline and basic cache knowledge but not the mix-up of both
and the relationship to performance.

Many thanks.

"Didi" <dp@tgi-sci.com> wrote in message 
news:1141780964.782022.72790@p10g2000cwp.googlegroups.com...
> Wilco, > here we go in a detailed comparison... It may be tougher to do than > a few postings exchanged, but having warned the beginners to beware > reading our stuff and think on their own, let's give it a try.
Sure comparisons between different architectures is difficult, especially if they are built using different design principles (DSP vs RISC, RISC vs CISC etc). But that has never stopped me before!
>> An ARM11 can't execute 3 independent loads/stores per cycle, >> every cycle. But it can read or write 64-bits per cycle. In 4 cycles >> it can read 4x4 16-bit values from 4 independent addresses. > > This is a key issue. You do not need 64 bits accesses, you > do need to access a table with coefficients - probably > at a static address, but can be lengthy (kilowords), an > area with the data which and an area with the filtered > results... this makes 3 *independent* accesses per cycle,
So these are 2 different ways of achieving the same bandwidth: N narrow memory accesses in parallel or a single N-way wide access are equivalent. DSPs often go for VLIW while general purpose CPUs use SIMD. Each of these has advantages and disadvantages. High end DSPs use both.
> The way you describe the ARM in question, I would say > it might be able manage the example - given the external > memory interface is fast enough (DDR2 should do, I suppose).
There is no need for DDR2, in your particular example there was no need for external storage as all data movements are done on-chip. With several buses running in parallel at several GBytes/s each that's fast enough. :-) You could get 1GByte/s bandwidth using 32-bit DDR SDRAM, which is still way too much for the example (which needs around 40MBytes/s if the samples are stored in DRAM).
> But this has to be proven - you know how it is, the devil is > in the detail. But then again, I have done the 5420 design > 5 years ago, I might have made a different choice today, > like I said in another posting, it takes more knowledge > than I now have for that ARM to be able to tell.
5 years ago there was no ARM11, and while ARM10 might be able to do it, it doesn't have many of the ARM11 features, so it would need to run at a much higher frequency, probably 200..250Mhz. Wilco
"jade" <jade.emily@msa.hinet.net> wrote in message 
news:1141869989.839053.82920@v46g2000cwv.googlegroups.com...
> Hello Wilco, > > The discussion till here seems over my knowledge but I'd like to know > the detail. > Could you please give some guildances where I can get it? > > I've pipeline and basic cache knowledge but not the mix-up of both > and the relationship to performance.
If you're interested in (micro) architecture then you need Hennessy and Patterson's "Computer Architecture a Quantitative Approach". Reading comp.arch and articles on sites like realworldtech.com may be useful (these are not embedded though). There are various books about the ARM architecture, eg. ARM System Architecture by Steve Furber,and ARM System Developer's Guide. The first is an introduction into the ARM architecture and various implementations, while the second is more software and optimization oriented with lots of highly optimized code examples. There is a DSP chapter of course with details on how to write FIR filters and such. Wilco
> So these are 2 different ways of achieving the same bandwidth: > N narrow memory accesses in parallel or a single N-way wide > access are equivalent.
No no, the addresses of all the 3 busses just cannot be tied together, you have to make 3 data accesses per cyle, plus obviously an instruction fetch. The processor has to be able to do 4 simultaneous memory accesses every 10 nS, each being 16 bits wide and having its own address - and allow the DMAC to keep on buffering in yet another address area at the same time. I do not know the ARM11, but I considered an MPC5200 - 400 MHz PPC with DDR - and I estimated it would be far from sufficient. Can you say the ARM11 at 500 MHz has 3 times the power the 400 MHz PPC (603e core, 32-bit DDR 266) has (this is what I thought would be about enough to care to do further evaluations)?
> DSPs often go for VLIW while general > purpose CPUs use SIMD. Each of these has advantages and > disadvantages. High end DSPs use both.
There is no VLIW with the 54xx series DSPs, and the issue is not opcode cycles, the PPC can do floating point MAC in a single cycle, probably the ARM can the same etc. The issue is memory bandwidth, not just sequential burst bandwidth, but random accesses to multiple areas. This is the major difference between a DSP and a general purpose processor.
> There is no need for DDR2, in your particular example there > was no need for external storage as all data movements are > done on-chip. With several buses running in parallel at several > GBytes/s each that's fast enough. :-)
Well the 5420 has about 200 kilowords - 400 kilobytes - of memory and my application does use it all. I don't know if the ARM has that much and whether the program code will be as short to fit in - I know the PPC does not have it, it will need external memory.
> 5 years ago there was no ARM11, and while ARM10 might > be able to do it, it doesn't have many of the ARM11 features, > so it would need to run at a much higher frequency, probably > 200..250Mhz.
The more you bring me back into it, the more details come back, and I would say you might be able to convince me that the 500 MHz ARM can do the job of a 100 MHz 54xx core - provided it has all the memory or you add the external it takes. But there are two 100 MHz cores in the 5420 ... I tend to think "no chance", and since I am too busy to do the entire work it takes to estimate whether I could fit it in such an ARM chip, I guess we'll have to wait until I have to design another device of the kind and pick the most suitable part at the moment, then I'll know. Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message > news:1141780964.782022.72790@p10g2000cwp.googlegroups.com... > > Wilco, > > here we go in a detailed comparison... It may be tougher to do than > > a few postings exchanged, but having warned the beginners to beware > > reading our stuff and think on their own, let's give it a try. > > Sure comparisons between different architectures is difficult, > especially if they are built using different design principles (DSP > vs RISC, RISC vs CISC etc). But that has never stopped me before! > > >> An ARM11 can't execute 3 independent loads/stores per cycle, > >> every cycle. But it can read or write 64-bits per cycle. In 4 cycles > >> it can read 4x4 16-bit values from 4 independent addresses. > > > > This is a key issue. You do not need 64 bits accesses, you > > do need to access a table with coefficients - probably > > at a static address, but can be lengthy (kilowords), an > > area with the data which and an area with the filtered > > results... this makes 3 *independent* accesses per cycle, > > So these are 2 different ways of achieving the same bandwidth: > N narrow memory accesses in parallel or a single N-way wide > access are equivalent. DSPs often go for VLIW while general > purpose CPUs use SIMD. Each of these has advantages and > disadvantages. High end DSPs use both. > > > The way you describe the ARM in question, I would say > > it might be able manage the example - given the external > > memory interface is fast enough (DDR2 should do, I suppose). > > There is no need for DDR2, in your particular example there > was no need for external storage as all data movements are > done on-chip. With several buses running in parallel at several > GBytes/s each that's fast enough. :-) > > You could get 1GByte/s bandwidth using 32-bit DDR SDRAM, > which is still way too much for the example (which needs around > 40MBytes/s if the samples are stored in DRAM). > > > But this has to be proven - you know how it is, the devil is > > in the detail. But then again, I have done the 5420 design > > 5 years ago, I might have made a different choice today, > > like I said in another posting, it takes more knowledge > > than I now have for that ARM to be able to tell. > > 5 years ago there was no ARM11, and while ARM10 might > be able to do it, it doesn't have many of the ARM11 features, > so it would need to run at a much higher frequency, probably > 200..250Mhz. > > Wilco
"Didi" <dp@tgi-sci.com> wrote in message 
news:1142035955.284418.33310@e56g2000cwe.googlegroups.com...
>> So these are 2 different ways of achieving the same bandwidth: >> N narrow memory accesses in parallel or a single N-way wide >> access are equivalent. > > No no, the addresses of all the 3 busses just cannot be tied > together, you have to make 3 data accesses per cyle, plus > obviously an instruction fetch.
There is no need to do 3 independent accesses per cycle. This is a very inefficient way of increasing bandwidth and that is why modern CPUs increase the width of buses instead. 64-bit is pretty normal these days in the embedded space, and 128-bit is being introduced in the high-end. With a 64-bit bus you can read 4 16-bit values per cycle, every cycle. This is clearly faster than reading 16-bits from 3 independent address per cycle, right? Maybe you could show a C snippet of what you do, then I'll show you an equivalent one that doesn't need 3 memory accesses per cycle.
> The processor has to be able to > do 4 simultaneous memory accesses every 10 nS, each being > 16 bits wide and having its own address - and allow the DMAC > to keep on buffering in yet another address area at the same time.
There is no need to do several independent accesses per cycle as long as you've got enough bandwidth. 4 16-bit accesses every 10ns is only 800MBytes/s. Just the data bandwidth between the core and L1 is 4GBytes/s on a 500Mhz ARM11 for example.
> I do not know the ARM11, but I considered an MPC5200 - 400 MHz > PPC with DDR - and I estimated it would be far from sufficient. > Can you say the ARM11 at 500 MHz has 3 times the power > the 400 MHz PPC (603e core, 32-bit DDR 266) has (this is > what I thought would be about enough to care to do further > evaluations)?
In terms of external bus bandwidth they are the same, 32-bit DDR 266 gives 1GByte/s bandwidth like the ARM11. The core is a 2-way out of order superscalar, similar in performance as an ARM11 (which is not superscalar but is more modern). The PPC L1 to integer core bandwidth is half that of ARM11, while its MAC performance is 8 times less (1 16-bit MAC every 4 cycles)... So low fixed point DSP performance is what kills it. It needs 400Mhz for 100M MACs. An ARM11 can do this in only 50Mhz.
> There is no VLIW with the 54xx series DSPs, and the issue > is not opcode cycles, the PPC can do floating point MAC > in a single cycle, probably the ARM can the same etc.
Its FP unit has single cycle FMAC just like ARM11 indeed. Using floats would increase the bandwidth and Mhz requirement by 2 times however. Both CPUs could do 100M FMACS plus 300M 32-bit memory accesses in 200MHz. If you can reuse values in registers then you can significantly reduce the number of memory accesses, this is easy with FIR filters for example.
> The issue is memory bandwidth, not just sequential > burst bandwidth, but random accesses to multiple areas. > This is the major difference between a DSP and a general > purpose processor.
Actually if you do the sums you'll see that general purpose processors have much higher bandwidth than the 5420. I guess you're using the external bus bandwidth and comparing that against the DSPs internal bandwidth. That's incorrect as most (90+%) data movement happens between the core and the L1 memory system.
> Well the 5420 has about 200 kilowords - 400 kilobytes - of > memory and my application does use it all. I don't know if the > ARM has that much and whether the program code will be > as short to fit in - I know the PPC does not have it, it will > need external memory.
It depends on what variant you use, but you can get an ARM11 with 32KB I&D caches and 128KByte L2 cache. As long as the working set fits in the caches the speed of external memory is irrelevant. Wilco
> There is no need to do 3 independent accesses per cycle. This is a > very inefficient way of increasing bandwidth and that is why modern > CPUs increase the width of buses instead.
This tells me you have never actually done any DSP programming. Please correct me if I am wrong (I certainly mean no offence).
> 64-bit is pretty normal > these days in the embedded space, and 128-bit is being introduced > in the high-end.
Well I have had a 64-bit wide PPC (8240/45) to communicate with my DSP based device and other things for over 5 years now...
> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle. > This is clearly faster than reading 16-bits from 3 independent address > per cycle, right?
No. Every 16 bit value has a separate address, which is - in the case of the 5420 - another 16 bits. I will not go into explanation why this is so, I guess there are sufficient books on digital signal processing around.
> Maybe you could show a C snippet of what you do, then I'll show you > an equivalent one that doesn't need 3 memory accesses per cycle.
There is no C source to show. I used assembly, a precursor of the VPA language I use nowadays on the PPC. I do not take seriously any programming done in C at all (it could be argued I am wrong but I shall not enter such a discussion), and I can definitely tell that using C on a DSP is a waste of time and/or money. Also, I will not go into further details on my device as it is still unique on the market (my competitors are still doing the job of one of the two cores I have in analog circuity and there are algorithms I use which I know they have been keen to guess for 5 years now...).
> There is no need to do several independent accesses per cycle as > long as you've got enough bandwidth. 4 16-bit accesses every 10ns > is only 800MBytes/s. Just the data bandwidth between the core and L1 > is 4GBytes/s on a 500Mhz ARM11 for example.
Here we go again, you don't want to believe DSPs have been designed as they are because of necessity. OK, I'll try a general example. There is an area - say, 4 kilobytes - with the coefficients, there is a circular queue - say, 64 k - with the incoming data, and there is another circular queue - say, again 64 k - with the filtered results. All are 16 bits wide, you do 4k MACs per sample each time starting one address further in the input queue and write the result to the output queue. Can you tell me how you do this without separate addresses (especially on the ARM where the registers are so scarce)? Finally, let me say this: there are many applications which do need a general purpose processor anyway and with some more power and memory bandwidth a DSP could be made unnecessary. Perhaps (probably?) bandwidths will get high enough and lead to the extinction of the DSPs as we know them. However, there is a long way to go until this happens. Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ Wilco Dijkstra wrote:
> "Didi" <dp@tgi-sci.com> wrote in message > news:1142035955.284418.33310@e56g2000cwe.googlegroups.com... > >> So these are 2 different ways of achieving the same bandwidth: > >> N narrow memory accesses in parallel or a single N-way wide > >> access are equivalent. > > > > No no, the addresses of all the 3 busses just cannot be tied > > together, you have to make 3 data accesses per cyle, plus > > obviously an instruction fetch. > > There is no need to do 3 independent accesses per cycle. This is a > very inefficient way of increasing bandwidth and that is why modern > CPUs increase the width of buses instead. 64-bit is pretty normal > these days in the embedded space, and 128-bit is being introduced > in the high-end. > > With a 64-bit bus you can read 4 16-bit values per cycle, every cycle. > This is clearly faster than reading 16-bits from 3 independent address > per cycle, right? > > Maybe you could show a C snippet of what you do, then I'll show you > an equivalent one that doesn't need 3 memory accesses per cycle. > > > The processor has to be able to > > do 4 simultaneous memory accesses every 10 nS, each being > > 16 bits wide and having its own address - and allow the DMAC > > to keep on buffering in yet another address area at the same time. > > There is no need to do several independent accesses per cycle as > long as you've got enough bandwidth. 4 16-bit accesses every 10ns > is only 800MBytes/s. Just the data bandwidth between the core and L1 > is 4GBytes/s on a 500Mhz ARM11 for example. > > > I do not know the ARM11, but I considered an MPC5200 - 400 MHz > > PPC with DDR - and I estimated it would be far from sufficient. > > Can you say the ARM11 at 500 MHz has 3 times the power > > the 400 MHz PPC (603e core, 32-bit DDR 266) has (this is > > what I thought would be about enough to care to do further > > evaluations)? > > In terms of external bus bandwidth they are the same, 32-bit DDR 266 > gives 1GByte/s bandwidth like the ARM11. The core is a 2-way out of > order superscalar, similar in performance as an ARM11 (which is not > superscalar but is more modern). The PPC L1 to integer core > bandwidth is half that of ARM11, while its MAC performance is 8 times > less (1 16-bit MAC every 4 cycles)... > > So low fixed point DSP performance is what kills it. It needs > 400Mhz for 100M MACs. An ARM11 can do this in only 50Mhz. > > > There is no VLIW with the 54xx series DSPs, and the issue > > is not opcode cycles, the PPC can do floating point MAC > > in a single cycle, probably the ARM can the same etc. > > Its FP unit has single cycle FMAC just like ARM11 indeed. > Using floats would increase the bandwidth and Mhz requirement > by 2 times however. Both CPUs could do 100M FMACS plus > 300M 32-bit memory accesses in 200MHz. If you can reuse > values in registers then you can significantly reduce the number > of memory accesses, this is easy with FIR filters for example. > > > The issue is memory bandwidth, not just sequential > > burst bandwidth, but random accesses to multiple areas. > > This is the major difference between a DSP and a general > > purpose processor. > > Actually if you do the sums you'll see that general purpose > processors have much higher bandwidth than the 5420. > I guess you're using the external bus bandwidth and comparing > that against the DSPs internal bandwidth. That's incorrect as > most (90+%) data movement happens between the core and the > L1 memory system. > > > Well the 5420 has about 200 kilowords - 400 kilobytes - of > > memory and my application does use it all. I don't know if the > > ARM has that much and whether the program code will be > > as short to fit in - I know the PPC does not have it, it will > > need external memory. > > It depends on what variant you use, but you can get an ARM11 with > 32KB I&D caches and 128KByte L2 cache. As long as the working > set fits in the caches the speed of external memory is irrelevant. > > Wilco
"Didi" <dp@tgi-sci.com> wrote in message 
news:1142132669.823913.252540@i40g2000cwc.googlegroups.com...
> >> With a 64-bit bus you can read 4 16-bit values per cycle, every cycle. >> This is clearly faster than reading 16-bits from 3 independent address >> per cycle, right? > > No. Every 16 bit value has a separate address, which is - in the case > of the 5420 - another 16 bits. I will not go into explanation why this > is so, I guess there are sufficient books on digital signal processing > around.
The usual way to handle this on a general purpose processor is to unroll and pack the loads. Can you explain why this is not applicable in your case? Steve.

Memfault Beyond the Launch