EmbeddedRelated.com
Forums
Memfault Beyond the Launch

Compare ARM MCU Vendors

Started by Dave Graffio September 1, 2010
2010-09-21 13:16, rickman skrev:
> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote: >> David Brown skrev: >>> A better solution for micros like that is a wider flash design with an >>> sram buffer in the flash module - that is certainly how some >>> manufacturers handle the problem. It is a simpler solution than a full >>> instruction cache because you have only a single "tag" (or perhaps two, >>> if you have two such buffers), and there are no issues with coherence or >>> anything else. The buffer of perhaps 256 bytes gets filled whenever you >>> access a new "page" in the flash, so that the processor then reads from >>> the buffer rather than directly from the flash. And if space/economics >>> allow, you have have a wider flash-to-buffer bus to keep up a high >>> bandwidth even with slow flash and a fast processor. >> >> The disadvantage of having a 256 byte wide memory, is power consumption. >> You will have 2048 active sense amplifiers. >> I dont see that coming soon. > > I hope you aren't involved in architecting new MCU designs. I don't > think anyone said they wanted 2048 sense amplifiers. I would either > interpret the above to be "256 bits" or I would consider an > implementation that used a 256 byte cache of some sort. What would be > the utility of a 256 byte wide interface to the Flash? Even the > fastest CM3 CPUs can't run at nearly that speed. > > Rick
I am certainly involved in the definition of new MCU designs, altough mostly by providing ideas. He said that he wanted a 256 byte buffer, and i really doubt that this should be interpreted as bits. He only said that the buffer will be filled when you accessed a new page, and did not state how many cycles it would take. From performance point of view, it makes more sense to load it in one cycle. If you start loading using sequential accesses to the flash, you will probably waste both cycles and power. The proposal is already implemented in page mode DRAMs, so it may make sense at first, unless you know more about flash internals. -- Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB
2010-09-21 13:09, rickman skrev:
> On Sep 3, 12:08 am, An Schwob in the USA<schwo...@aol.com> wrote: >> On Sep 2, 7:37 pm, "Dave Graffio"<wscra...@yahoo.com> wrote: >> >>> "antedeluvian" wrote... >>>> A really great part is the Cypress PSOC5 which gives a great deal of >>>> flexibilty because of its configurabilty. >> >>>> Unfortunately it appears to be made of pure unobtanium. >> >>> Not true. I've heard it's being designed by the engineering firm of Tuttle and Dunsel. >>> (Capt, Retired) >> >> Dave, >> >> you heard strange things such as Luminary (TI) being low quality. They >> manufacture on one of the highest quality production lines in the >> world TSMC. Marvell does not design and manufacture MCUs, they do high >> end application processors, no flash but lots of MHz. Atmel started >> strong with ARM7 and ARM9 but is weak in Cortex-M3, their focus >> shifted very much towards AVR32. NXP offers the fastest Cortex-M3 with >> Flash, btw. did you know that Toshiba has the fastest M3 running from >> internal SDRAM? Did you know that Energy Micro achieve better power >> numbers using the Cortex-M3 then any other vendor even those using >> Cortex-M0?
Except for the AT32UC3L which has better power consumption than the Energy Micro on a pure CPU comparision. Didn't study the peripherals power consumption of the EFM32, but I know it has a UART which can run at low frequency. The UC3L has the "Sleep-Walking" feature which will turn on/off power to the peripherals using the event system, rather than waking up the CPU to do this. The EFM32 is very limited in flash size. Almost all customers I talk to want to have fairly large amount of flash.
> > Not sure what you mean by "Atmel is weak in Cortex-M3". The CM3 is > new enough that not everyone has their products out yet. I think > Atmel dilly dallied too long with the CM3, but I expect this was due > to company goal issues and not because of "weakness" of any kind.
Have no clue about decision criteria, but I have always had the opinion that as long as only ST has the CM3, Atmel does not need it, If/when NXP & others go for it, then Atmel needs it as well, but there is time to catch up. While this strategy will lose some designs I know that if I sum up all the "big" design lost by beeing a tad late, this is less than half the volume of a single project where additional focus on the SAM7 enabled Atmel to win a project which will move to CM3 once Atmel has that available. The parallel strategy of having an 8 & 32 bit AVR has enabled Atmel to enter the mobile phone market, which people watching NASDAQ has noticed this year.
> They have a competing 32 bit MCU product and I expect they could only > throw so many resources at bringing out a totally new MCU line. Give > them a few more months and I think they will not disappoint.
There are two groups AVR (8 and 32 bit) is handled by one group and the ARM products are handled in another group. You will see competition for resources between Cortex-M3 chips and ARM9 chips, but not between Cortex-M3 and 32 bit AVR chips.
> > "Fastest" is always a short lived title. Clock speed is seldom a > determining criterion in selecting an MCU and I expect it is often > given too much weight by engineers when initially winnowing their MCU > choices. It is a simple number that is easy to verify. CPU speed is > a much more complex measurement that is very hard to verify for your > application, but this is the one that may actually make a difference > in your design. > > >> PSoC5 is a great product and if your volume production does not start >> before 2011, you might want to order a FirstTouch for PSoC 5, just $49 >> free tools, several sensors for acceleration, temperature, capacitive >> touch and readily available. Got one on my desk, like it.http://www.cypress.com/psoc5is a good place to start. > > How can you plan to use a part, even if you can wait six months for > production, if you don't know the price? Has anyone heard a number > for production pricing on the PSOC5? > > >> I could write a lot more about ARM / Cortex-MCUs because that's what I >> have been dealing with since the first ARM7 MCUs hit the market. If >> you need professional help, with the selection write an email to >> microcontroller (skip this at gmail) -dod comm >> It would go a long way if you would list your requirements, you get >> better answers. >> >> For a list with many articles about Cortex based MCUs check out this >> one:http://mcu-related.com/architectures/35-cortex-m3 > > Some three or four years ago I put together a list of ARM7 devices > available. By the time Luminary came on the scene it got to be too > much work to update. Now with all the CMx devices out there it would > be a major effort to keep this updated. Does anyone have a > comprehensive comparison of features and capabilities of the CMx MCUs > available? > > Rick
-- Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB
On 23/09/2010 08:30, Ulf Samuelsson wrote:
> 2010-09-21 13:16, rickman skrev: >> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote: >>> David Brown skrev: >>>> A better solution for micros like that is a wider flash design with an >>>> sram buffer in the flash module - that is certainly how some >>>> manufacturers handle the problem. It is a simpler solution than a full >>>> instruction cache because you have only a single "tag" (or perhaps two, >>>> if you have two such buffers), and there are no issues with >>>> coherence or >>>> anything else. The buffer of perhaps 256 bytes gets filled whenever you >>>> access a new "page" in the flash, so that the processor then reads from >>>> the buffer rather than directly from the flash. And if space/economics >>>> allow, you have have a wider flash-to-buffer bus to keep up a high >>>> bandwidth even with slow flash and a fast processor. >>> >>> The disadvantage of having a 256 byte wide memory, is power consumption. >>> You will have 2048 active sense amplifiers. >>> I dont see that coming soon. >> >> I hope you aren't involved in architecting new MCU designs. I don't >> think anyone said they wanted 2048 sense amplifiers. I would either >> interpret the above to be "256 bits" or I would consider an >> implementation that used a 256 byte cache of some sort. What would be >> the utility of a 256 byte wide interface to the Flash? Even the >> fastest CM3 CPUs can't run at nearly that speed. >> >> Rick > > I am certainly involved in the definition of new MCU designs, > altough mostly by providing ideas. > > He said that he wanted a 256 byte buffer, and i really doubt > that this should be interpreted as bits. > > He only said that the buffer will be filled when you > accessed a new page, and did not state how many cycles it would take. > From performance point of view, it makes more sense to load it in one > cycle. If you start loading using sequential accesses to the flash, > you will probably waste both cycles and power. >
From the performance viewpoint, loading in a single cycle would be ideal - but from the space and power viewpoint that would be a bad idea. So loading sequentially with a medium-width bus (I suggested 64 bit) is likely to be the best compromise.
> The proposal is already implemented in page mode DRAMs, > so it may make sense at first, unless you know more about flash internals. >
I know enough about flash internals to know it is a useful idea, and could be a cheap, simple and low-power method to improve flash access speeds. I know enough about chip design and logic design to know that de-coupling the flash access and control logic from the processor's memory bus will simplify some of the logic, and reduce the levels of combination logic that must be completed within a clock cycle. It also allows the processor and the flash module to run at independent speeds. I also know that it would complicate other parts of the design, and the extra unnecessary flash reads may outweigh the flash reads spared. In effect, my suggestion is a cache front-end to the flash with just one line, but a large line width and perhaps two-way associativity. The ideal balance may be different - half the line width and four-way associativity might be better. It's all a balancing act. I also know that I don't know nearly enough detail to judge whether the sums will add up to making this a good idea in practice. It depends on so many factors such as flash design (some incur extra delays when switching pages), access times, power requirements of the different parts, access patterns on the instruction bus, area costs, design times and design costs, etc., and I don't know anything about these. I am also fairly sure that the designers who /are/ capable of calculating and balancing these tradeoffs will have thought of doing something like this. There are certainly similar solutions used on many high-speed flash microcontrollers, though they may be much smaller. It could well be that my suggested 256 byte buffer is far too big, and that an 8 or 16 byte buffer is fine when your cpu clock speed is not too much higher than the flash access speed.
2010-09-23 09:52, David Brown skrev:
> On 23/09/2010 08:30, Ulf Samuelsson wrote: >> 2010-09-21 13:16, rickman skrev: >>> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote: >>>> David Brown skrev: >>>>> A better solution for micros like that is a wider flash design with an >>>>> sram buffer in the flash module - that is certainly how some >>>>> manufacturers handle the problem. It is a simpler solution than a full >>>>> instruction cache because you have only a single "tag" (or perhaps >>>>> two, >>>>> if you have two such buffers), and there are no issues with >>>>> coherence or >>>>> anything else. The buffer of perhaps 256 bytes gets filled whenever >>>>> you >>>>> access a new "page" in the flash, so that the processor then reads >>>>> from >>>>> the buffer rather than directly from the flash. And if space/economics >>>>> allow, you have have a wider flash-to-buffer bus to keep up a high >>>>> bandwidth even with slow flash and a fast processor. >>>> >>>> The disadvantage of having a 256 byte wide memory, is power >>>> consumption. >>>> You will have 2048 active sense amplifiers. >>>> I dont see that coming soon. >>> >>> I hope you aren't involved in architecting new MCU designs. I don't >>> think anyone said they wanted 2048 sense amplifiers. I would either >>> interpret the above to be "256 bits" or I would consider an >>> implementation that used a 256 byte cache of some sort. What would be >>> the utility of a 256 byte wide interface to the Flash? Even the >>> fastest CM3 CPUs can't run at nearly that speed. >>> >>> Rick >> >> I am certainly involved in the definition of new MCU designs, >> altough mostly by providing ideas. >> >> He said that he wanted a 256 byte buffer, and i really doubt >> that this should be interpreted as bits. >> >> He only said that the buffer will be filled when you >> accessed a new page, and did not state how many cycles it would take. >> From performance point of view, it makes more sense to load it in one >> cycle. If you start loading using sequential accesses to the flash, >> you will probably waste both cycles and power. >> > > From the performance viewpoint, loading in a single cycle would be > ideal - but from the space and power viewpoint that would be a bad idea. > So loading sequentially with a medium-width bus (I suggested 64 bit) is > likely to be the best compromise. > >> The proposal is already implemented in page mode DRAMs, >> so it may make sense at first, unless you know more about flash >> internals. >> > > I know enough about flash internals to know it is a useful idea, and > could be a cheap, simple and low-power method to improve flash access > speeds. I know enough about chip design and logic design to know that > de-coupling the flash access and control logic from the processor's > memory bus will simplify some of the logic, and reduce the levels of > combination logic that must be completed within a clock cycle. It also > allows the processor and the flash module to run at independent speeds. > > I also know that it would complicate other parts of the design, and the > extra unnecessary flash reads may outweigh the flash reads spared. > > In effect, my suggestion is a cache front-end to the flash with just one > line, but a large line width and perhaps two-way associativity. The > ideal balance may be different - half the line width and four-way > associativity might be better. It's all a balancing act. > > I also know that I don't know nearly enough detail to judge whether the > sums will add up to making this a good idea in practice. It depends on > so many factors such as flash design (some incur extra delays when > switching pages), access times, power requirements of the different > parts, access patterns on the instruction bus, area costs, design times > and design costs, etc., and I don't know anything about these. > > I am also fairly sure that the designers who /are/ capable of > calculating and balancing these tradeoffs will have thought of doing > something like this. There are certainly similar solutions used on many > high-speed flash microcontrollers, though they may be much smaller. It > could well be that my suggested 256 byte buffer is far too big, and that > an 8 or 16 byte buffer is fine when your cpu clock speed is not too much > higher than the flash access speed.
I think that the way this is implemented is through an instruction queue. This was implemented in early 32 bit chips, like the NS32016 and the MC68010. The MC68010 even allowed you to loop in the queue. It is not implemented on the ARM, and I do not think that it exists in the Cortex-M3 as well. The AVR32 does have a queue and will fetch instructions faster that it will execute, and this is one reason why the AVR32 can handle waitstates better than the Cortex-m3. On the AVR32 you lose about 7% due to the waitstate on the first access, and you only need one waitstate at 66 MHz, the top speed of current production parts. You will not get 100% hitrate, so your boost will be less than 7%. If you do add SRAM, you might be better off adding a branch-target cache to get rid of the initial waitstates. Once you start running sequential fetch the wide memory will give you a benefit but even a 128 bit flash can be a hog on power. The SAM7 with a 32 bit flash is faster than an LPC2xxx with 128 bit flash, at the same frequency when running Thumb Mode, and it draws much less current. The faster flash makes all the difference. The LPC2xxxx can offset this with a slightly higher clock rate, but that will not make power consumption better. -- Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB
On Sep 23, 2:30=A0am, Ulf Samuelsson <nospam....@atmel.com> wrote:
> 2010-09-21 13:16, rickman skrev: > > > > > On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> =A0wrote: > >> David Brown skrev: > >>> A better solution for micros like that is a wider flash design with a=
n
> >>> sram buffer in the flash module - that is certainly how some > >>> manufacturers handle the problem. =A0It is a simpler solution than a =
full
> >>> instruction cache because you have only a single "tag" (or perhaps tw=
o,
> >>> if you have two such buffers), and there are no issues with coherence=
or
> >>> anything else. =A0The buffer of perhaps 256 bytes gets filled wheneve=
r you
> >>> access a new "page" in the flash, so that the processor then reads fr=
om
> >>> the buffer rather than directly from the flash. =A0And if space/econo=
mics
> >>> allow, you have have a wider flash-to-buffer bus to keep up a high > >>> bandwidth even with slow flash and a fast processor. > > >> The disadvantage of having a 256 byte wide memory, is power consumptio=
n.
> >> You will have 2048 active sense amplifiers. > >> I dont see that coming soon. > > > I hope you aren't involved in architecting new MCU designs. =A0I don't > > think anyone said they wanted 2048 sense amplifiers. =A0I would either > > interpret the above to be "256 bits" or I would consider an > > implementation that used a 256 byte cache of some sort. =A0What would b=
e
> > the utility of a 256 byte wide interface to the Flash? =A0Even the > > fastest CM3 CPUs can't run at nearly that speed. > > > Rick > > I am certainly involved in the definition of new MCU designs, > altough mostly by providing ideas. > > He said that he wanted a 256 byte buffer, and i really doubt > that this should be interpreted as bits. > > He only said that the buffer will be filled when you > accessed a new page, and did not state how many cycles it would take. > =A0From performance point of view, it makes more sense to load it in one > cycle. =A0If you start loading using sequential accesses to the flash, > you will probably waste both cycles and power. > > The proposal is already implemented in page mode DRAMs, > so it may make sense at first, unless you know more about flash internals=
.
> > -- > Best Regards > Ulf Samuelsson > These are my own personal opinions, which may (or may not) > be shared by my employer Atmel Nordic AB
I wish I had a nickle for every time some one said bytes when they meant bits or the other way around... especially when it was me! I'm not following you really. You say using 2048 sense amps is power hungry and then you say loading it in sequential accesses will waste power. You can't have it both ways, one is worse than the other unless you are saying each is equally bad. The difference is that using 2048 sense amplifiers pulls the data out of the flash some huge factor faster than the CPU can use it! So it has pretty much no upside to match the downside. BTW, the power consumption is not because of using 2048 sense amplifiers. The power consumption comes from making the reads. So if the CPU only needed the Flash to make new reads proportionally less often, the power consumption might not be much if any higher than if it were read out sequentially. That however, is a big IF. My point is that there are very many tradeoffs and very many solutions. Only a few have worked out in practice given the fundamentals of IC design. As the processing makes more and more transistors cheaper and cheaper, the tradeoffs shift to different solutions. So there is no one answer and yesterday's bad idea can be tomorrow's great idea. But we only have to concern ourselves with today. Rick
On Sep 23, 12:52=A0pm, Ulf Samuelsson <nospam....@atmel.com> wrote:
> 2010-09-23 09:52, David Brown skrev: > > > > > On 23/09/2010 08:30, Ulf Samuelsson wrote: > >> 2010-09-21 13:16, rickman skrev: > >>> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote: > >>>> David Brown skrev: > >>>>> A better solution for micros like that is a wider flash design with=
an
> >>>>> sram buffer in the flash module - that is certainly how some > >>>>> manufacturers handle the problem. It is a simpler solution than a f=
ull
> >>>>> instruction cache because you have only a single "tag" (or perhaps > >>>>> two, > >>>>> if you have two such buffers), and there are no issues with > >>>>> coherence or > >>>>> anything else. The buffer of perhaps 256 bytes gets filled whenever > >>>>> you > >>>>> access a new "page" in the flash, so that the processor then reads > >>>>> from > >>>>> the buffer rather than directly from the flash. And if space/econom=
ics
> >>>>> allow, you have have a wider flash-to-buffer bus to keep up a high > >>>>> bandwidth even with slow flash and a fast processor. > > >>>> The disadvantage of having a 256 byte wide memory, is power > >>>> consumption. > >>>> You will have 2048 active sense amplifiers. > >>>> I dont see that coming soon. > > >>> I hope you aren't involved in architecting new MCU designs. I don't > >>> think anyone said they wanted 2048 sense amplifiers. I would either > >>> interpret the above to be "256 bits" or I would consider an > >>> implementation that used a 256 byte cache of some sort. What would be > >>> the utility of a 256 byte wide interface to the Flash? Even the > >>> fastest CM3 CPUs can't run at nearly that speed. > > >>> Rick > > >> I am certainly involved in the definition of new MCU designs, > >> altough mostly by providing ideas. > > >> He said that he wanted a 256 byte buffer, and i really doubt > >> that this should be interpreted as bits. > > >> He only said that the buffer will be filled when you > >> accessed a new page, and did not state how many cycles it would take. > >> From performance point of view, it makes more sense to load it in one > >> cycle. If you start loading using sequential accesses to the flash, > >> you will probably waste both cycles and power. > > > =A0From the performance viewpoint, loading in a single cycle would be > > ideal - but from the space and power viewpoint that would be a bad idea=
.
> > So loading sequentially with a medium-width bus (I suggested 64 bit) is > > likely to be the best compromise. > > >> The proposal is already implemented in page mode DRAMs, > >> so it may make sense at first, unless you know more about flash > >> internals. > > > I know enough about flash internals to know it is a useful idea, and > > could be a cheap, simple and low-power method to improve flash access > > speeds. I know enough about chip design and logic design to know that > > de-coupling the flash access and control logic from the processor's > > memory bus will simplify some of the logic, and reduce the levels of > > combination logic that must be completed within a clock cycle. It also > > allows the processor and the flash module to run at independent speeds. > > > I also know that it would complicate other parts of the design, and the > > extra unnecessary flash reads may outweigh the flash reads spared. > > > In effect, my suggestion is a cache front-end to the flash with just on=
e
> > line, but a large line width and perhaps two-way associativity. The > > ideal balance may be different - half the line width and four-way > > associativity might be better. It's all a balancing act. > > > I also know that I don't know nearly enough detail to judge whether the > > sums will add up to making this a good idea in practice. It depends on > > so many factors such as flash design (some incur extra delays when > > switching pages), access times, power requirements of the different > > parts, access patterns on the instruction bus, area costs, design times > > and design costs, etc., and I don't know anything about these. > > > I am also fairly sure that the designers who /are/ capable of > > calculating and balancing these tradeoffs will have thought of doing > > something like this. There are certainly similar solutions used on many > > high-speed flash microcontrollers, though they may be much smaller. It > > could well be that my suggested 256 byte buffer is far too big, and tha=
t
> > an 8 or 16 byte buffer is fine when your cpu clock speed is not too muc=
h
> > higher than the flash access speed. > > I think that the way this is implemented is through an instruction > queue. This was implemented in early 32 bit chips, like the NS32016 > and the MC68010. The MC68010 even allowed you to loop in the queue. > > It is not implemented on the ARM, and I do not think that it > exists in the Cortex-M3 as well. The AVR32 does have a queue > and will fetch instructions faster that it will execute, > and this is one reason why the AVR32 can handle waitstates > better than the Cortex-m3. > > On the AVR32 you lose about 7% due to the waitstate on the first access, > and you only need one waitstate at 66 MHz, the top speed of current > production parts. > > You will not get 100% hitrate, so your boost will be less than 7%. > If you do add SRAM, you might be better off adding a branch-target > cache to get rid of the initial waitstates. > Once you start running sequential fetch the wide memory will > give you a benefit but even a 128 bit flash can be a hog on power. > > The SAM7 with a 32 bit flash is faster than an LPC2xxx with 128 bit > flash, at the same frequency when running Thumb Mode, > and it draws much less current. > The faster flash makes all the difference. > The LPC2xxxx can offset this with a slightly higher clock rate, > but that will not make power consumption better.
So many IFs, so little time. Benchmarking is an art, not a science. Best to run your app and see what is faster for your app. Rick
2010-09-23 23:15, rickman skrev:
> On Sep 23, 2:30 am, Ulf Samuelsson<nospam....@atmel.com> wrote: >> 2010-09-21 13:16, rickman skrev: >> >> >> >>> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote: >>>> David Brown skrev: >>>>> A better solution for micros like that is a wider flash design with an >>>>> sram buffer in the flash module - that is certainly how some >>>>> manufacturers handle the problem. It is a simpler solution than a full >>>>> instruction cache because you have only a single "tag" (or perhaps two, >>>>> if you have two such buffers), and there are no issues with coherence or >>>>> anything else. The buffer of perhaps 256 bytes gets filled whenever you >>>>> access a new "page" in the flash, so that the processor then reads from >>>>> the buffer rather than directly from the flash. And if space/economics >>>>> allow, you have have a wider flash-to-buffer bus to keep up a high >>>>> bandwidth even with slow flash and a fast processor. >> >>>> The disadvantage of having a 256 byte wide memory, is power consumption. >>>> You will have 2048 active sense amplifiers. >>>> I dont see that coming soon. >> >>> I hope you aren't involved in architecting new MCU designs. I don't >>> think anyone said they wanted 2048 sense amplifiers. I would either >>> interpret the above to be "256 bits" or I would consider an >>> implementation that used a 256 byte cache of some sort. What would be >>> the utility of a 256 byte wide interface to the Flash? Even the >>> fastest CM3 CPUs can't run at nearly that speed. >> >>> Rick >> >> I am certainly involved in the definition of new MCU designs, >> altough mostly by providing ideas. >> >> He said that he wanted a 256 byte buffer, and i really doubt >> that this should be interpreted as bits. >> >> He only said that the buffer will be filled when you >> accessed a new page, and did not state how many cycles it would take. >> From performance point of view, it makes more sense to load it in one >> cycle. If you start loading using sequential accesses to the flash, >> you will probably waste both cycles and power. >> >> The proposal is already implemented in page mode DRAMs, >> so it may make sense at first, unless you know more about flash internals. >> >> -- >> Best Regards >> Ulf Samuelsson >> These are my own personal opinions, which may (or may not) >> be shared by my employer Atmel Nordic AB > > I wish I had a nickle for every time some one said bytes when they > meant bits or the other way around... especially when it was me!
Very few speak about bits for buffers.
> > I'm not following you really. You say using 2048 sense amps is power > hungry and then you say loading it in sequential accesses will waste > power.
If you jump to a position in flash page, and the next instruction is a jump to another flash page, then if you have a 2048 bit flash, you certainly waste power. If you have a 64 bit flash , which starts reading until it has 256 bytes of cache, then again you waste power. if You jump forward within the page, then do you read all intermediate values? Then you waste power and performance. If you dont read intermeidates,, then you have to skip the contents of the buffer, or move to a real cache with valid bits. Reading a word at the time is not wasting power. Then you read as much as you need. Drawback is that you do not have fast sequential access. The "locality" of instructions is important. What is the likelyhood that the CPU will execute 2,3,4,...,n instructions in a sequence, and that gives you the ideal buffer size. The ideal buffer size is of course application dependent. > You can't have it both ways, one is worse than the other
> unless you are saying each is equally bad. The difference is that > using 2048 sense amplifiers pulls the data out of the flash some huge > factor faster than the CPU can use it! So it has pretty much no > upside to match the downside. >
> BTW, the power consumption is not because of using 2048 sense > amplifiers. The power consumption comes from making the reads. So if > the CPU only needed the Flash to make new reads proportionally less > often, the power consumption might not be much if any higher than if > it were read out sequentially. That however, is a big IF. > > My point is that there are very many tradeoffs and very many > solutions. Only a few have worked out in practice given the > fundamentals of IC design. As the processing makes more and more > transistors cheaper and cheaper, the tradeoffs shift to different > solutions. So there is no one answer and yesterday's bad idea can be > tomorrow's great idea. But we only have to concern ourselves with > today. > > Rick
-- Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB
2010-09-23 23:17, rickman skrev:
> On Sep 23, 12:52 pm, Ulf Samuelsson<nospam....@atmel.com> wrote: >> 2010-09-23 09:52, David Brown skrev: >> >> >> >>> On 23/09/2010 08:30, Ulf Samuelsson wrote: >>>> 2010-09-21 13:16, rickman skrev: >>>>> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote: >>>>>> David Brown skrev: >>>>>>> A better solution for micros like that is a wider flash design with an >>>>>>> sram buffer in the flash module - that is certainly how some >>>>>>> manufacturers handle the problem. It is a simpler solution than a full >>>>>>> instruction cache because you have only a single "tag" (or perhaps >>>>>>> two, >>>>>>> if you have two such buffers), and there are no issues with >>>>>>> coherence or >>>>>>> anything else. The buffer of perhaps 256 bytes gets filled whenever >>>>>>> you >>>>>>> access a new "page" in the flash, so that the processor then reads >>>>>>> from >>>>>>> the buffer rather than directly from the flash. And if space/economics >>>>>>> allow, you have have a wider flash-to-buffer bus to keep up a high >>>>>>> bandwidth even with slow flash and a fast processor. >> >>>>>> The disadvantage of having a 256 byte wide memory, is power >>>>>> consumption. >>>>>> You will have 2048 active sense amplifiers. >>>>>> I dont see that coming soon. >> >>>>> I hope you aren't involved in architecting new MCU designs. I don't >>>>> think anyone said they wanted 2048 sense amplifiers. I would either >>>>> interpret the above to be "256 bits" or I would consider an >>>>> implementation that used a 256 byte cache of some sort. What would be >>>>> the utility of a 256 byte wide interface to the Flash? Even the >>>>> fastest CM3 CPUs can't run at nearly that speed. >> >>>>> Rick >> >>>> I am certainly involved in the definition of new MCU designs, >>>> altough mostly by providing ideas. >> >>>> He said that he wanted a 256 byte buffer, and i really doubt >>>> that this should be interpreted as bits. >> >>>> He only said that the buffer will be filled when you >>>> accessed a new page, and did not state how many cycles it would take. >>>> From performance point of view, it makes more sense to load it in one >>>> cycle. If you start loading using sequential accesses to the flash, >>>> you will probably waste both cycles and power. >> >>> From the performance viewpoint, loading in a single cycle would be >>> ideal - but from the space and power viewpoint that would be a bad idea. >>> So loading sequentially with a medium-width bus (I suggested 64 bit) is >>> likely to be the best compromise. >> >>>> The proposal is already implemented in page mode DRAMs, >>>> so it may make sense at first, unless you know more about flash >>>> internals. >> >>> I know enough about flash internals to know it is a useful idea, and >>> could be a cheap, simple and low-power method to improve flash access >>> speeds. I know enough about chip design and logic design to know that >>> de-coupling the flash access and control logic from the processor's >>> memory bus will simplify some of the logic, and reduce the levels of >>> combination logic that must be completed within a clock cycle. It also >>> allows the processor and the flash module to run at independent speeds. >> >>> I also know that it would complicate other parts of the design, and the >>> extra unnecessary flash reads may outweigh the flash reads spared. >> >>> In effect, my suggestion is a cache front-end to the flash with just one >>> line, but a large line width and perhaps two-way associativity. The >>> ideal balance may be different - half the line width and four-way >>> associativity might be better. It's all a balancing act. >> >>> I also know that I don't know nearly enough detail to judge whether the >>> sums will add up to making this a good idea in practice. It depends on >>> so many factors such as flash design (some incur extra delays when >>> switching pages), access times, power requirements of the different >>> parts, access patterns on the instruction bus, area costs, design times >>> and design costs, etc., and I don't know anything about these. >> >>> I am also fairly sure that the designers who /are/ capable of >>> calculating and balancing these tradeoffs will have thought of doing >>> something like this. There are certainly similar solutions used on many >>> high-speed flash microcontrollers, though they may be much smaller. It >>> could well be that my suggested 256 byte buffer is far too big, and that >>> an 8 or 16 byte buffer is fine when your cpu clock speed is not too much >>> higher than the flash access speed. >> >> I think that the way this is implemented is through an instruction >> queue. This was implemented in early 32 bit chips, like the NS32016 >> and the MC68010. The MC68010 even allowed you to loop in the queue. >> >> It is not implemented on the ARM, and I do not think that it >> exists in the Cortex-M3 as well. The AVR32 does have a queue >> and will fetch instructions faster that it will execute, >> and this is one reason why the AVR32 can handle waitstates >> better than the Cortex-m3. >> >> On the AVR32 you lose about 7% due to the waitstate on the first access, >> and you only need one waitstate at 66 MHz, the top speed of current >> production parts. >> >> You will not get 100% hitrate, so your boost will be less than 7%. >> If you do add SRAM, you might be better off adding a branch-target >> cache to get rid of the initial waitstates. >> Once you start running sequential fetch the wide memory will >> give you a benefit but even a 128 bit flash can be a hog on power. >> >> The SAM7 with a 32 bit flash is faster than an LPC2xxx with 128 bit >> flash, at the same frequency when running Thumb Mode, >> and it draws much less current. >> The faster flash makes all the difference. >> The LPC2xxxx can offset this with a slightly higher clock rate, >> but that will not make power consumption better. > > So many IFs, so little time. Benchmarking is an art, not a science. > Best to run your app and see what is faster for your app. > > Rick
If fast is the parameter you are looking for! Many applications need a certain speed, but once it is there, it will not use additional performance. You have a basic selection between speed and code size on the ARM7, but with waitstates the lower memory use of the Thumb instruction set can make it faster than the ARM instruction set. -- Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB
On 24/09/2010 01:48, Ulf Samuelsson wrote:
> 2010-09-23 23:17, rickman skrev: >> On Sep 23, 12:52 pm, Ulf Samuelsson<nospam....@atmel.com> wrote: >>> 2010-09-23 09:52, David Brown skrev: >>> >>> >>> >>>> On 23/09/2010 08:30, Ulf Samuelsson wrote: >>>>> 2010-09-21 13:16, rickman skrev: >>>>>> On Sep 20, 4:30 am, Ulf Samuelsson<u...@a-t-m-e-l.com> wrote: >>>>>>> David Brown skrev: >>>>>>>> A better solution for micros like that is a wider flash design >>>>>>>> with an >>>>>>>> sram buffer in the flash module - that is certainly how some >>>>>>>> manufacturers handle the problem. It is a simpler solution than >>>>>>>> a full >>>>>>>> instruction cache because you have only a single "tag" (or perhaps >>>>>>>> two, >>>>>>>> if you have two such buffers), and there are no issues with >>>>>>>> coherence or >>>>>>>> anything else. The buffer of perhaps 256 bytes gets filled whenever >>>>>>>> you >>>>>>>> access a new "page" in the flash, so that the processor then reads >>>>>>>> from >>>>>>>> the buffer rather than directly from the flash. And if >>>>>>>> space/economics >>>>>>>> allow, you have have a wider flash-to-buffer bus to keep up a high >>>>>>>> bandwidth even with slow flash and a fast processor. >>> >>>>>>> The disadvantage of having a 256 byte wide memory, is power >>>>>>> consumption. >>>>>>> You will have 2048 active sense amplifiers. >>>>>>> I dont see that coming soon. >>> >>>>>> I hope you aren't involved in architecting new MCU designs. I don't >>>>>> think anyone said they wanted 2048 sense amplifiers. I would either >>>>>> interpret the above to be "256 bits" or I would consider an >>>>>> implementation that used a 256 byte cache of some sort. What would be >>>>>> the utility of a 256 byte wide interface to the Flash? Even the >>>>>> fastest CM3 CPUs can't run at nearly that speed. >>> >>>>>> Rick >>> >>>>> I am certainly involved in the definition of new MCU designs, >>>>> altough mostly by providing ideas. >>> >>>>> He said that he wanted a 256 byte buffer, and i really doubt >>>>> that this should be interpreted as bits. >>> >>>>> He only said that the buffer will be filled when you >>>>> accessed a new page, and did not state how many cycles it would take. >>>>> From performance point of view, it makes more sense to load it in one >>>>> cycle. If you start loading using sequential accesses to the flash, >>>>> you will probably waste both cycles and power. >>> >>>> From the performance viewpoint, loading in a single cycle would be >>>> ideal - but from the space and power viewpoint that would be a bad >>>> idea. >>>> So loading sequentially with a medium-width bus (I suggested 64 bit) is >>>> likely to be the best compromise. >>> >>>>> The proposal is already implemented in page mode DRAMs, >>>>> so it may make sense at first, unless you know more about flash >>>>> internals. >>> >>>> I know enough about flash internals to know it is a useful idea, and >>>> could be a cheap, simple and low-power method to improve flash access >>>> speeds. I know enough about chip design and logic design to know that >>>> de-coupling the flash access and control logic from the processor's >>>> memory bus will simplify some of the logic, and reduce the levels of >>>> combination logic that must be completed within a clock cycle. It also >>>> allows the processor and the flash module to run at independent speeds. >>> >>>> I also know that it would complicate other parts of the design, and the >>>> extra unnecessary flash reads may outweigh the flash reads spared. >>> >>>> In effect, my suggestion is a cache front-end to the flash with just >>>> one >>>> line, but a large line width and perhaps two-way associativity. The >>>> ideal balance may be different - half the line width and four-way >>>> associativity might be better. It's all a balancing act. >>> >>>> I also know that I don't know nearly enough detail to judge whether the >>>> sums will add up to making this a good idea in practice. It depends on >>>> so many factors such as flash design (some incur extra delays when >>>> switching pages), access times, power requirements of the different >>>> parts, access patterns on the instruction bus, area costs, design times >>>> and design costs, etc., and I don't know anything about these. >>> >>>> I am also fairly sure that the designers who /are/ capable of >>>> calculating and balancing these tradeoffs will have thought of doing >>>> something like this. There are certainly similar solutions used on many >>>> high-speed flash microcontrollers, though they may be much smaller. It >>>> could well be that my suggested 256 byte buffer is far too big, and >>>> that >>>> an 8 or 16 byte buffer is fine when your cpu clock speed is not too >>>> much >>>> higher than the flash access speed. >>> >>> I think that the way this is implemented is through an instruction >>> queue. This was implemented in early 32 bit chips, like the NS32016 >>> and the MC68010. The MC68010 even allowed you to loop in the queue. >>> >>> It is not implemented on the ARM, and I do not think that it >>> exists in the Cortex-M3 as well. The AVR32 does have a queue >>> and will fetch instructions faster that it will execute, >>> and this is one reason why the AVR32 can handle waitstates >>> better than the Cortex-m3. >>> >>> On the AVR32 you lose about 7% due to the waitstate on the first access, >>> and you only need one waitstate at 66 MHz, the top speed of current >>> production parts. >>> >>> You will not get 100% hitrate, so your boost will be less than 7%. >>> If you do add SRAM, you might be better off adding a branch-target >>> cache to get rid of the initial waitstates. >>> Once you start running sequential fetch the wide memory will >>> give you a benefit but even a 128 bit flash can be a hog on power. >>> >>> The SAM7 with a 32 bit flash is faster than an LPC2xxx with 128 bit >>> flash, at the same frequency when running Thumb Mode, >>> and it draws much less current. >>> The faster flash makes all the difference. >>> The LPC2xxxx can offset this with a slightly higher clock rate, >>> but that will not make power consumption better. >> >> So many IFs, so little time. Benchmarking is an art, not a science. >> Best to run your app and see what is faster for your app. >> >> Rick > > If fast is the parameter you are looking for! > Many applications need a certain speed, but once it is there, > it will not use additional performance. > > You have a basic selection between speed and code size on the ARM7, > but with waitstates the lower memory use of the Thumb instruction set > can make it faster than the ARM instruction set. >
I think it is interesting to look at the history of instruction sets. Long ago, there were two competing ideas - there were CISC instruction sets with very varied instruction sets (typically in 8-bit parts), and RISC which were all consistent and wide (typically 32-bit). It turns out that both extremes were "wrong", and the most efficient modern instruction sets for small devices are 16-bit wide for most instructions, with some 32-bit (or 48-bit) for flexibility. Consistency and orthogonality of the architecture is important, but should not be taken to extremes. There is a lot to like about the Thumb2 set - I think it's a big improvement on the original ARM ISA. Of course, the 68000 designers at Motorola figured this out about 30 years ago...

David Brown wrote:
> > I think it is interesting to look at the history of instruction sets. > Long ago, there were two competing ideas - there were CISC instruction > sets with very varied instruction sets (typically in 8-bit parts), and > RISC which were all consistent and wide (typically 32-bit). It turns > out that both extremes were "wrong", and the most efficient modern > instruction sets for small devices are 16-bit wide for most > instructions, with some 32-bit (or 48-bit) for flexibility. Consistency > and orthogonality of the architecture is important, but should not be > taken to extremes. There is a lot to like about the Thumb2 set - I > think it's a big improvement on the original ARM ISA.
There is a lot I like about the Thumb 2 ISA. I have worked on ISA design on several commercial processors. M68K (that you mention and I clipped) patterned after the PDP11 is the classical orthogonal instruction set. It takes a lot more than that to make an efficient processor. The TI9900 a contemporary of the 68K development with similar roots was less effective at executing applications. The difference between 68k and 9900 was essentially data flow inside the processor. The 9900 was easier to program in many ways BUT it relied on more indirect data accesses to data and was significantly less efficient. Clean data flow between executing instructions is as important as the instructions. The classic example of how to kill a processor is to need to process memory management through primary accumulator(s). This killed several processors in the 90's. RISC can be very efficient but requires a different approach to code generation. The xgate is a simple 16 bit RISC that driven with a well designed code generator will compete with well designed CISC processors. Our application based benchmarks showed that the difference was about 10%. There is a whole area of instruction design that trades compile time complexity for processor simplicity or timing. Many of the most successful ISAs make very good use of redundant instructions. This has been done four ways. 1) Conceptually have a page 0 space where some RAM areas are more valuable but the access is quicker and requires less generated code. 2) Memory to memory operations that don't require intervening register involvement. 3) Instructions with implied arguments. For example inc dec compliment. 4) Mapping registers (real and virtual) on RAM space reduces register specific instructions an extreme example is the move machines with one instruction. Regards, w.. -- Walter Banks Byte Craft Limited http://www.bytecraft.com

Memfault Beyond the Launch