On 23 Jun, 03:10, rickman <gnu...@gmail.com> wrote:> On Jun 22, 7:34 pm, "Wilco Dijkstra" <Wilco_dot_Dijks...@ntlworld.com> > wrote: > > > "rickman" <gnu...@gmail.com> wrote in messagenews:1182540547.184518.91830@o11g2000prd.googlegroups.com... > > > The data sheet says it requires one wait state from 24 to 48 MHz and 2 > > > wait states above 48 MHz. So compared to the Luminary parts running > > > at 50 MHz with *NO* wait states, I say the ST M3 parts are dogs. > > > It's not that bad. Cortex-M3 has a prefetch buffer and branch prediction. This > > means that the cost of a single waitstate can be hidden for conditional branches, > > ie. only indirect branches have a penalty. With 2 wait states the branch prediction > > only works on unconditional branches, so you'll get a slowdown. However you can > > change loops to use an unconditional branch at the end so they run at the speed > > of zero-wait state memory. > > I don't follow what you are saying at all. Branch prediction relates > to pipelining. I don't see how it relates to wait states.Adding a wait state is the same as increasing the pipeline depth, and branch prediction coupled with prefetching can hide some of that latency.> The required wait states are added because of a fundamental limitation in > the bandwidth of the Flash memory. You can look-ahead all you want, > but you can still only return one word from Flash per 3 clock cycles > when running at full speed. Unless the Flash word width is increased > (as in the NXP designs) or the instruction size is reduced (many > Cortex M3 instructions are 16 bits, but they would need to be 10 bits > with two wait states and 32 bit memory) this will limit performance in > the Cortex M3. > > Am I completely missing something? I always leave that possibility > open...You have to increase the fetch width if you add waitstates, that's a given. The M3 TRM recommends 64-bit flash fetch. While this allows straightline code to run at full speed, branches are still slow. What I meant is that M3 has branch optimizations that reduce this slowdown.> Yes, that is all pretty obvious. But it does not address the point of > the Y intercept being a hefty 9 mA. This is not as high as the Analog > Devices ARM parts, but it is significant. It means you need to use > modes and hardware features to get better power savings compared to > just slowing the clock which is much simpler to do.>From what I remember, I think it is the PLL and flash that cause highpower consumption at lower speeds. The specs showed various settings that use significantly less than 9mA below 8MHz. So I don't think it really burns 9mA at 0MHz (which is what I think you mean with Y intercept, right?).> The power for the STM32 is from the data sheet and includes basic > power to the peripherals, although since they are not performing work > the power they draw is less than typical. So the comparison is not > perfect.The ST specs list some numbers with peripherals off, and that is less than half the normal current, 21mA from flash at 72MHz IIRC - pretty good.> > Now consider that an M3 runs twice as fast as a SAM7 at the same frequency, > > so the MIPS/Watt is 4 times as good! > > How do you support the claim that the M3 runs twice as fast as the > SAM7 at the same frequency??? Maybe I don't want to know...Because I've benchmarked it myself?> I have not seen anyone claim that the M3 runs twice as fast as an ARM7 > clock for clock. I don't even think ARM claims that. I seem toARM claims 70% improvement, but that is an understatement due to the Dhrystone benchmark. EEMBC is more accurate here.> recall that after all the hoopla is removed, you might see from 10% to > 25% speedup from the ARM7 to the M3 depending on your application. If > you disagree on this basic point, then I think we should not discuss > it further. I have seen it discussed before ad nauseum with no hard > information to support any given number.You seem to forget that the M3 was designed from the ground up to be an efficient MCU, while ARM7 wasn't: 1. Thumb-2 is more efficient than Thumb-1 (just like ARM is faster than Thumb-1) 2. Cortex-M3 has a more efficient micro architecture than ARM7 (it beats ARM9) 3. Cortex-M3 slows down less with wait states than ARM7 4. New features like unaligned access, hardware divide, better interrupts Both 1 and 2 provide about 40% improvement, so about 2 times speedup together. 3 can give somewhere between 10 and 20% extra performance, but only at higher frequencies when waitstates are used. Depending on the application used, 4 can make a huge difference (I've seen benchmarks go twice as fast just because of hardware divide). I've personally benchmarked the effects of 1, 2 and 4 on a large amount of benchmarks. Now where did you get that 10-25% number from?> > > Actually, I guess a power factor would be required for the SAM7 parts > > > as well since they run with one wait state at their top speed. So > > > maybe the STM32 part do better on power than I realized! > > > If you're trying to compare MIPS/Watt don't forget that different cores running > > at the same frequency do not run at the same speed. > > Yes, but that is a small delta compared to adding waitstates with a 2x > or 3x reduction in performance and therefore the same effect on power > efficiency.Not at all. Adding waitstates doesn't slow you down by that much as you make the memory wider (not doing that makes no sense at all, so I do not consider that a valid option). But instruction set and microarchitecture differences can easily make a factor of 2 difference, as shown above. Wilco
New ARM Cortex Microcontroller Product Family from STMicroelectronics
Started by ●June 18, 2007
Reply by ●June 24, 20072007-06-24
Reply by ●June 24, 20072007-06-24
On 23 Jun, 03:10, rickman <gnu...@gmail.com> wrote:> On Jun 22, 7:34 pm, "Wilco Dijkstra" <Wilco_dot_Dijks...@ntlworld.com> > wrote: > > > "rickman" <gnu...@gmail.com> wrote in messagenews:1182540547.184518.91830@o11g2000prd.googlegroups.com... > > > The data sheet says it requires one wait state from 24 to 48 MHz and 2 > > > wait states above 48 MHz. So compared to the Luminary parts running > > > at 50 MHz with *NO* wait states, I say the ST M3 parts are dogs. > > > It's not that bad. Cortex-M3 has a prefetch buffer and branch prediction. This > > means that the cost of a single waitstate can be hidden for conditional branches, > > ie. only indirect branches have a penalty. With 2 wait states the branch prediction > > only works on unconditional branches, so you'll get a slowdown. However you can > > change loops to use an unconditional branch at the end so they run at the speed > > of zero-wait state memory. > > I don't follow what you are saying at all. Branch prediction relates > to pipelining. I don't see how it relates to wait states.Adding a wait state is the same as increasing the pipeline depth, and branch prediction coupled with prefetching can hide some of that latency.> The required wait states are added because of a fundamental limitation in > the bandwidth of the Flash memory. You can look-ahead all you want, > but you can still only return one word from Flash per 3 clock cycles > when running at full speed. Unless the Flash word width is increased > (as in the NXP designs) or the instruction size is reduced (many > Cortex M3 instructions are 16 bits, but they would need to be 10 bits > with two wait states and 32 bit memory) this will limit performance in > the Cortex M3. > > Am I completely missing something? I always leave that possibility > open...You have to increase the fetch width if you add waitstates, that's a given. The M3 TRM recommends 64-bit flash fetch. While this allows straightline code to run at full speed, branches are still slow. What I meant is that M3 has branch optimizations that reduce this slowdown.> Yes, that is all pretty obvious. But it does not address the point of > the Y intercept being a hefty 9 mA. This is not as high as the Analog > Devices ARM parts, but it is significant. It means you need to use > modes and hardware features to get better power savings compared to > just slowing the clock which is much simpler to do.>From what I remember, I think it is the PLL and flash that cause highpower consumption at lower speeds. The specs showed various settings that use significantly less than 9mA below 8MHz. So I don't think it really burns 9mA at 0MHz (which is what I think you mean with Y intercept, right?).> The power for the STM32 is from the data sheet and includes basic > power to the peripherals, although since they are not performing work > the power they draw is less than typical. So the comparison is not > perfect.The ST specs list some numbers with peripherals off, and that is less than half the normal current, 21mA from flash at 72MHz IIRC - pretty good.> > Now consider that an M3 runs twice as fast as a SAM7 at the same frequency, > > so the MIPS/Watt is 4 times as good! > > How do you support the claim that the M3 runs twice as fast as the > SAM7 at the same frequency??? Maybe I don't want to know...Because I've benchmarked it myself?> I have not seen anyone claim that the M3 runs twice as fast as an ARM7 > clock for clock. I don't even think ARM claims that. I seem toARM claims 70% improvement, but that is an understatement due to the Dhrystone benchmark. EEMBC is more accurate here.> recall that after all the hoopla is removed, you might see from 10% to > 25% speedup from the ARM7 to the M3 depending on your application. If > you disagree on this basic point, then I think we should not discuss > it further. I have seen it discussed before ad nauseum with no hard > information to support any given number.You seem to forget that the M3 was designed from the ground up to be an efficient MCU, while ARM7 wasn't: 1. Thumb-2 is more efficient than Thumb-1 (just like ARM is faster than Thumb-1) 2. Cortex-M3 has a more efficient micro architecture than ARM7 (it beats ARM9) 3. Cortex-M3 slows down less with wait states than ARM7 4. New features like unaligned access, hardware divide, better interrupts Both 1 and 2 provide about 40% improvement, so about 2 times speedup together. 3 can give somewhere between 10 and 20% extra performance, but only at higher frequencies when waitstates are used. Depending on the application used, 4 can make a huge difference (I've seen benchmarks go twice as fast just because of hardware divide). I've personally benchmarked the effects of 1, 2 and 4 on a large amount of benchmarks. Now where did you get that 10-25% number from?> > > Actually, I guess a power factor would be required for the SAM7 parts > > > as well since they run with one wait state at their top speed. So > > > maybe the STM32 part do better on power than I realized! > > > If you're trying to compare MIPS/Watt don't forget that different cores running > > at the same frequency do not run at the same speed. > > Yes, but that is a small delta compared to adding waitstates with a 2x > or 3x reduction in performance and therefore the same effect on power > efficiency.Not at all. Adding waitstates doesn't slow you down by that much as you make the memory wider (not doing that makes no sense at all, so I do not consider that a valid option). But instruction set and microarchitecture differences can easily make a factor of 2 difference, as shown above. Wilco
Reply by ●June 24, 20072007-06-24
On 21 Jun, 10:00, Jim Granville <no.s...@designtools.maps.co.nz> wrote:> Wilco Dijkstra wrote: > > "Jim Granville" <no.s...@designtools.maps.co.nz> wrote in messagenews:4679cb90$1@clear.net.nz... > > >>Wilco Dijkstra wrote: > > >>>>Yes, MOVC is only for table lookup - In 80C51 there is no write to code memory, and no read code from ram, without > >>>>special HW bridge steps. > > >>>So you agree table lookup can read code memory. That means there is a > >>>connection (at least an uni-directional bus) between code and data memories, > >>>so it is not at all like the initial Harvards (which did not have such a connection). > > >>Not quite, all it means is a connection between code and the accumulator. The std 80C51 has no Code-Data opcodes, and > >>only one Code-ACC opcode. > > > There are various ways of implementing such a bus, including reusing the > > fetch bus if you only need read-only access. That doesn't make it any less of > > a connection. > > You've lost me. You have claimed a "conection between code and data > memories", and unless you somehow claim the accumulator is DATA memory, > there is no such link. The 80C51 does not have a MOVC Data,@A+PC opcode > it only has the single MOVC A,@A+PC opcode. Code -> ACC. From there, > you have to use other opcodes to move ACC to various Data memory > choices. There are many more Acc-> Data Space opcodes.Whether you can transfer data from code to data memory in a single instruction is not relevant. You always need 2 accesses anyway (read from code into a register, store register into memory). Instruction sets that can do this in a single instruction still need these 2 steps and an intermediate buffer. Put simply, the question is "Can you read data from code memory at a random address?". Wilco
Reply by ●June 24, 20072007-06-24
On Sun, 24 Jun 2007 09:06:43 -0700, wilco.dijkstra@ntlworld.com wrote:>Put simply, the question is "Can you read data from code memory at a >random address?"That's the question you'd ask to decide between a pure Harvard and a modified Harvard? (The discussion makes me think so, but I'm not sure.) Jon
Reply by ●June 24, 20072007-06-24
wilco.dijkstra@ntlworld.com wrote:> On 21 Jun, 10:00, Jim Granville <no.s...@designtools.maps.co.nz> > wrote: > >>Wilco Dijkstra wrote: >> >>>"Jim Granville" <no.s...@designtools.maps.co.nz> wrote in messagenews:4679cb90$1@clear.net.nz... >> >>>>Wilco Dijkstra wrote: >> >>>>>>Yes, MOVC is only for table lookup - In 80C51 there is no write to code memory, and no read code from ram, without >>>>>>special HW bridge steps. >> >>>>>So you agree table lookup can read code memory. That means there is a >>>>>connection (at least an uni-directional bus) between code and data memories, >>>>>so it is not at all like the initial Harvards (which did not have such a connection). >> >>>>Not quite, all it means is a connection between code and the accumulator. The std 80C51 has no Code-Data opcodes, and >>>>only one Code-ACC opcode. >> >>>There are various ways of implementing such a bus, including reusing the >>>fetch bus if you only need read-only access. That doesn't make it any less of >>>a connection. >> >>You've lost me. You have claimed a "conection between code and data >>memories", and unless you somehow claim the accumulator is DATA memory, >>there is no such link. The 80C51 does not have a MOVC Data,@A+PC opcode >>it only has the single MOVC A,@A+PC opcode. Code -> ACC. From there, >>you have to use other opcodes to move ACC to various Data memory >>choices. There are many more Acc-> Data Space opcodes. > > > Whether you can transfer data from code to data memory in a single > instruction is not relevant. You always need 2 accesses anyway (read > from code into a register, store register into memory). Instruction > sets that can do this in a single instruction still need these 2 steps > and an intermediate buffer. > > Put simply, the question is "Can you read data from code memory at a > random address?".OOPs, Why is this changed from your previous assertion : "That means there is a connection (at least an uni-directional bus) between code and data memories. " If you are going into the fine ground of semantics (which I find rather a waste of time, as end users frankly don't care ) then you do need to be consistant in your details ;) ie an even better question is can you buy silicon that fits in each of your categories. If not, then what is the point ? -jg
Reply by ●June 25, 20072007-06-25
> Put simply, the question is "Can you read data from code memory at a > random address?". > > WilcoAny Harvard architecture can read data into registers using the immediate addressing mode. getdata: execute label[r0:d] ret label: label0: ld r0,#0 label1: ld r0,#1 label2: ld r0,#2 label3: ld r0,#3 label4: ld r0,#4 label5: ld r0,#5 label6: ld r0,#6 Should work on a Harvard architecture without going to a "modified Harvard" so the question is irrelevant to determine Harvard/No Harvard. -- Best Regards, Ulf Samuelsson This is intended to be my personal opinion which may, or may not be shared by my employer Atmel Nordic AB
Reply by ●June 25, 20072007-06-25
> > OOPs, Why is this changed from your previous assertion : > > "That means there is a connection (at least an uni-directional bus) > between code and data memories. " > > If you are going into the fine ground of semantics (which I find rather > a waste of time, as end users frankly don't care ) then you do need > to be consistant in your details ;) >Wilco always modifies his statements, and forgets what he is saying.... -- Best Regards, Ulf Samuelsson This is intended to be my personal opinion which may, or may not be shared by my employer Atmel Nordic AB
Reply by ●June 26, 20072007-06-26
On 24 Jun, 18:31, Jonathan Kirwan <jkir...@easystreet.com> wrote:> On Sun, 24 Jun 2007 09:06:43 -0700, wilco.dijks...@ntlworld.com wrote: > >Put simply, the question is "Can you read data from code memory at a > >random address?" > > That's the question you'd ask to decide between a pure Harvard and a > modified Harvard? (The discussion makes me think so, but I'm not > sure.)Yes this question would be sufficient to decide between the early Harvards and everything that came after it. However the address space question "can you run code from data memory?" is a more important distinction for today's Harvards. What I am getting at is that just about all CPUs can do read data from code memory and so are not as pure as the early Harvards. As I'm sure you are aware, at the time Von Neumann was seen as a major improvement over Harvard, and they were quickly forgotten. It was much later that modified Harvards were used for performance reasons. Wilco
Reply by ●June 26, 20072007-06-26
On 24 Jun, 22:06, Jim Granville <no.s...@designtools.maps.co.nz> wrote:> wilco.dijks...@ntlworld.com wrote: > > On 21 Jun, 10:00, Jim Granville <no.s...@designtools.maps.co.nz> > > wrote: > > >>Wilco Dijkstra wrote: > > >>>"Jim Granville" <no.s...@designtools.maps.co.nz> wrote in messagenews:4679cb90$1@clear.net.nz... > > >>>>Wilco Dijkstra wrote: > > >>>>>>Yes, MOVC is only for table lookup - In 80C51 there is no write to code memory, and no read code from ram, without > >>>>>>special HW bridge steps. > > >>>>>So you agree table lookup can read code memory. That means there is a > >>>>>connection (at least an uni-directional bus) between code and data memories, > >>>>>so it is not at all like the initial Harvards (which did not have such a connection). > > >>>>Not quite, all it means is a connection between code and the accumulator. The std 80C51 has no Code-Data opcodes, and > >>>>only one Code-ACC opcode. > > >>>There are various ways of implementing such a bus, including reusing the > >>>fetch bus if you only need read-only access. That doesn't make it any less of > >>>a connection. > > >>You've lost me. You have claimed a "conection between code and data > >>memories", and unless you somehow claim the accumulator is DATA memory, > >>there is no such link. The 80C51 does not have a MOVC Data,@A+PC opcode > >>it only has the single MOVC A,@A+PC opcode. Code -> ACC. From there, > >>you have to use other opcodes to move ACC to various Data memory > >>choices. There are many more Acc-> Data Space opcodes. > > > Whether you can transfer data from code to data memory in a single > > instruction is not relevant. You always need 2 accesses anyway (read > > from code into a register, store register into memory). Instruction > > sets that can do this in a single instruction still need these 2 steps > > and an intermediate buffer. > > > Put simply, the question is "Can you read data from code memory at a > > random address?". > > OOPs, Why is this changed from your previous assertion : > > "That means there is a connection (at least an uni-directional bus) > between code and data memories. "If you can read data from code memory there must be a connection via a bus. If you have a bus, you can at least read and usually write. What difference you see?> ie an even better question is can you buy silicon that fits in each > of your categories. If not, then what is the point ?My point is precisely that the early Harvards are no longer in use. Ie. current Harvards are not "true Harvards", so it makes no sense to use that definition for the current CPUs. Wilco
Reply by ●June 26, 20072007-06-26
On 25 Jun, 17:12, "Ulf Samuelsson" <u...@a-t-m-e-l.com> wrote:> > Put simply, the question is "Can you read data from code memory at a > > random address?". > > > Wilco > > Any Harvard architecture can read data into registers using the immediate > addressing mode. > > getdata: execute label[r0:d] > ret > > label: > label0: ld r0,#0 > label1: ld r0,#1 > label2: ld r0,#2 > label3: ld r0,#3 > label4: ld r0,#4 > label5: ld r0,#5 > label6: ld r0,#6 > > Should work on a Harvard architecture without going to > a "modified Harvard" so the question is irrelevant to determine > Harvard/No Harvard.So how does this read the *data* contained in the code memory? I'd like to read the bitpatterns of the instructions, not execute them. Wilco