New ARM Cortex Microcontroller Product Family from STMicroelectronics| page 5

Reply by Bill Giovino ●June 21, 20072007-06-21

>"Jim Granville" wrote...
> One clarify : How credible is your ST contact ? - as I cannot see "share
> the same packet buffer" anywhere in the user manual, or drawings, and
> nothing suggests that conflict.
> Search cannot find 'USB' inside the CAN chapter, nor 'CAN' inside the
> USB chapter ?
> It also does not make chip-design sense, surely it is harder to
> lock/overlap some block resource like that, in this cut/paste world ?
>
> That said, it is a strange thing to 'make up'/ admit to if untrue, so
> perhaps it is coming via the errata pipeline ?
>
> -jg

My contacts are very credible. And, unfortunately, what we are discussing is something
that would not be made obvious in datasheet diagrams. Datasheet diagrams are mean to
provide an overview of functionality in as clear a visual format as possible. Like the
model of the atom, it almost never reflects 100% what's inside the chip.

Chip buffers are RAM, and RAM is very greedy when it comes to die area. Allowing for an
extra buffer could price out the chip as non-competitive, or it could be unwieldy from a
layout POV.

My GUESS - the chip was designed for a primary customer, who wanted either USB or CAN
(but not both). During chip design, a smart marketing person asked about adding the
extra peripheral. Adding the CAN or USB adds small pieces of a penny to the chip cost.
But adding an extra buffer probably priced the chip beyond what was quoted to the target
customer.

Jim, if you want to take this off-line, I can be reached at the first email address on
this page:
http://www.microcontroller.com/Embedded.asp?did=23

-Bill.

Reply by Bill Giovino ●June 21, 20072007-06-21

"Jim Granville" wrote...
> I'm with Jon on this, the semantics matter little; but it would be
> good to answer the simple question "Can it execute code from RAM?"
> - in some systems, that is useful.

In the case of my original article:
http://www.microcontroller.com/news/arm_cortex_stm.asp
the answer is YES - the ST part can execute code from RAM off the data (system) bus. Of
course, there will be extra cycles.

Back when I was but a fledgling FAE, in my presentations I used to label architectures
as Harvard, Modified Harvard, Von Neumann, etc.

In PPT presentations, engineers would ALWAYS debate amongst themselves as to the
differences between these architectures, and sometimes whether or not Von Newman (What,
Me Worry???). was spelled right...

What you decide to call the architecture is much less important than what you decide to
do with it.

Bill

Reply by Jim Granville ●June 21, 20072007-06-21

Bill Giovino wrote:

> "Jim Granville" wrote...
> 
>>I'm with Jon on this, the semantics matter little; but it would be
>>good to answer the simple question "Can it execute code from RAM?"
>>- in some systems, that is useful.
> 
> 
> In the case of my original article:
> http://www.microcontroller.com/news/arm_cortex_stm.asp
> the answer is YES - the ST part can execute code from RAM off the data (system) bus. Of
> course, there will be extra cycles.

In some parts, RAM CODE execution is promoted for speed (due to slower 
FLASH speeds).
Is that not the case in the ST device Core/RAM/FLASH combination ?

-jg

Reply by Bill Giovino ●June 22, 20072007-06-22

"Jim Granville" wrote...
> Bill Giovino wrote:
>
> > "Jim Granville" wrote...
> >
> >>I'm with Jon on this, the semantics matter little; but it would be
> >>good to answer the simple question "Can it execute code from RAM?"
> >>- in some systems, that is useful.
> >
> >
> > In the case of my original article:
> > http://www.microcontroller.com/news/arm_cortex_stm.asp
> > the answer is YES - the ST part can execute code from RAM off the data (system) bus.
Of
> > course, there will be extra cycles.
>
> In some parts, RAM CODE execution is promoted for speed (due to slower
> FLASH speeds).
> Is that not the case in the ST device Core/RAM/FLASH combination ?
>
> -jg

Good question... in the ST part, if you are running out of Flash at zero wait states,
then you are getting simultaneous fetches from both data & address buses - taking full
advantage of the Harvard (ahem!) architecture gets you *mostly* single-cycle execution.

But for the same example, if you are running out of RAM, then you are using the same bus
for instructions and data, you lose the advantages of the Harvard architecture and so
it's slower.

However, if you are running out of Flash with the CPU at a higher speed than the Flash,
and so the Flash requires wait states while taking advantage of the Harvard
architecture - the speed compared to running instructions & data out of RAM off the data
bus and so there are extra cycles the answer is - it depends...

Reply by Eric ●June 22, 20072007-06-22

On Jun 22, 12:40 am, "Bill Giovino" <conta...@microcontroller.com>
wrote:

> However, if you are running out of Flash with the CPU at a higher speed than the Flash,
> and so the Flash requires wait states while taking advantage of the Harvard
> architecture

Any idea if the ST Cortex M3 can run without wait states from flash at
their rated speed? That would be quite impressive.

Eric

Reply by rickman ●June 22, 20072007-06-22

On Jun 22, 2:34 pm, Eric <englere_...@yahoo.com> wrote:
> On Jun 22, 12:40 am, "Bill Giovino" <conta...@microcontroller.com>
> wrote:
>
> > However, if you are running out of Flash with the CPU at a higher speed than the Flash,
> > and so the Flash requires wait states while taking advantage of the Harvard
> > architecture
>
> Any idea if the ST Cortex M3 can run without wait states from flash at
> their rated speed? That would be quite impressive.

The data sheet says it requires one wait state from 24 to 48 MHz and 2
wait states above 48 MHz.  So compared to the Luminary parts running
at 50 MHz with *NO* wait states, I say the ST M3 parts are dogs.

The power consumption is not great either, at least not compared to
parts like Atmel SAM7. The advertisement says it gets "0.5 mA/MHz in
RUN mode from Flash", but this is not very accurate.  The power curve
does not have a 0.5 mA/MHz slope.  The STM32F103 data sheet shows
higher current per MHz at low clock speeds with a Y intercept of about
9 mA.  I think the lower mA/MHz at higher clock speeds reflects the
lower MIPS available due to the required wait states.  Accounting for
that, the mA/MHz ranges from 0.54 at 24 MHz to 0.88 at 72 MHz.  I
think this may be better than the Luminary Stellaris parts, but not as
good as the Atmel SAM7 parts which are claimed to be a true 0.5 mA/MHz
with very low static current in the uA range.  I have not looked at
the newer Luminary parts in detail.

Actually, I guess a power factor would be required for the SAM7 parts
as well since they run with one wait state at their top speed.  So
maybe the STM32 part do better on power than I realized!

I am still waiting for Luminary to announce parts on a smaller
geometry process.  I was told they would be out toward the end of the
year in a 130 nm process, IIRC.  These parts should be very low power,
but I don't know if they will keep 5 volt tolerance and what the
static current will be.

Reply by JeffM ●June 22, 20072007-06-22

>Bill Giovino wrote
>>http://www.microcontroller.com/news/arm_cortex_stm.asp
>>STMicroelectronics has introduced the new STM32 microcontroller family,
>>based on the Harvard architecture ARM Cortex.
>>
FreeRTOS.org wrote:
>Where have you been all this time?  ;o)
>http://groups.google.com/group/comp.arch.embedded/browse_thread/thread/528fb9dd63e29756/a16733f4109c7f42?lnk=gst&q=%22ST+announce+their+Cortex-M3+micros%22&rnum=1#a16733f4109c7f42

Note to Richard:
When posting Google Groups links, the browse_frm paradigm is nicer.
http://groups.google.com/group/comp.arch.embedded/browse_frm/thread/528fb9dd63e29756/a16733f4109c7f42?q=announce.their.Cortex.M3.micros

In addition, using periods (or hyphens[1]) to form phrases
makes things more searchable (no  %22ST  stuff)
...and  lnk=gst&  is just noise.
.
.
[1] A hyphen (grease-monkey)
will find e.g BOTH **grease monkey** AND **greasemonkey**.

Reply by Wilco Dijkstra ●June 22, 20072007-06-22

"rickman" <gnuarm@gmail.com> wrote in message news:1182540547.184518.91830@o11g2000prd.googlegroups.com...
> On Jun 22, 2:34 pm, Eric <englere_...@yahoo.com> wrote:
>> On Jun 22, 12:40 am, "Bill Giovino" <conta...@microcontroller.com>
>> wrote:
>>
>> > However, if you are running out of Flash with the CPU at a higher speed than the Flash,
>> > and so the Flash requires wait states while taking advantage of the Harvard
>> > architecture
>>
>> Any idea if the ST Cortex M3 can run without wait states from flash at
>> their rated speed? That would be quite impressive.
>
> The data sheet says it requires one wait state from 24 to 48 MHz and 2
> wait states above 48 MHz.  So compared to the Luminary parts running
> at 50 MHz with *NO* wait states, I say the ST M3 parts are dogs.

It's not that bad. Cortex-M3 has a prefetch buffer and branch prediction. This
means that the cost of a single waitstate can be hidden for conditional branches,
ie. only indirect branches have a penalty. With 2 wait states the branch prediction
only works on unconditional branches, so you'll get a slowdown. However you can
change loops to use an unconditional branch at the end so they run at the speed
of zero-wait state memory.

> The power consumption is not great either, at least not compared to
> parts like Atmel SAM7. The advertisement says it gets "0.5 mA/MHz in
> RUN mode from Flash", but this is not very accurate.  The power curve
> does not have a 0.5 mA/MHz slope.  The STM32F103 data sheet shows
> higher current per MHz at low clock speeds with a Y intercept of about
> 9 mA.  I think the lower mA/MHz at higher clock speeds reflects the
> lower MIPS available due to the required wait states.

It is the flash power consumption. When you add wait states the power
consumption flash drops to 50% (1 wait state) or 33% (2 wait states). Ie.
the flash has identical power consumption at 24, 48 and 72MHz.

Of course the secondary effect of adding wait states is the core slows down
and so uses less power. Based on their numbers I estimate the slowdown is
between 10 and 15% - not too bad for 2 wait states.

> Accounting for
> that, the mA/MHz ranges from 0.54 at 24 MHz to 0.88 at 72 MHz.  I
> think this may be better than the Luminary Stellaris parts, but not as
> good as the Atmel SAM7 parts which are claimed to be a true 0.5 mA/MHz
> with very low static current in the uA range.  I have not looked at
> the newer Luminary parts in detail.

I calculate 40mA at 72MHz, so 0.56mA/MHz. Not quite 0.5, but close.
But I don't see where you get the idea they are worse than SAM7. I'm not sure
what part you were comparing with, but the SAM7A3 (also CAN and USB like
STM32F103) shows 70mA at 60MHz, or more than twice at the same frequency.

Now consider that an M3 runs twice as fast as a SAM7 at the same frequency,
so the MIPS/Watt is 4 times as good!

> Actually, I guess a power factor would be required for the SAM7 parts
> as well since they run with one wait state at their top speed.  So
> maybe the STM32 part do better on power than I realized!

If you're trying to compare MIPS/Watt don't forget that different cores running
at the same frequency do not run at the same speed.

Wilco

Reply by rickman ●June 22, 20072007-06-22

On Jun 22, 7:34 pm, "Wilco Dijkstra" <Wilco_dot_Dijks...@ntlworld.com>
wrote:
> "rickman" <gnu...@gmail.com> wrote in messagenews:1182540547.184518.91830@o11g2000prd.googlegroups.com...
> > The data sheet says it requires one wait state from 24 to 48 MHz and 2
> > wait states above 48 MHz.  So compared to the Luminary parts running
> > at 50 MHz with *NO* wait states, I say the ST M3 parts are dogs.
>
> It's not that bad. Cortex-M3 has a prefetch buffer and branch prediction. This
> means that the cost of a single waitstate can be hidden for conditional branches,
> ie. only indirect branches have a penalty. With 2 wait states the branch prediction
> only works on unconditional branches, so you'll get a slowdown. However you can
> change loops to use an unconditional branch at the end so they run at the speed
> of zero-wait state memory.

I don't follow what you are saying at all.  Branch prediction relates
to pipelining.  I don't see how it relates to wait states.  The
required wait states are added because of a fundamental limitation in
the bandwidth of the Flash memory.  You can look-ahead all you want,
but you can still only return one word from Flash per 3 clock cycles
when running at full speed.  Unless the Flash word width is increased
(as in the NXP designs) or the instruction size is reduced (many
Cortex M3 instructions are 16 bits, but they would need to be 10 bits
with two wait states and 32 bit memory) this will limit performance in
the Cortex M3.

Am I completely missing something?  I always leave that possibility
open...

> > The power consumption is not great either, at least not compared to
> > parts like Atmel SAM7. The advertisement says it gets "0.5 mA/MHz in
> > RUN mode from Flash", but this is not very accurate.  The power curve
> > does not have a 0.5 mA/MHz slope.  The STM32F103 data sheet shows
> > higher current per MHz at low clock speeds with a Y intercept of about
> > 9 mA.  I think the lower mA/MHz at higher clock speeds reflects the
> > lower MIPS available due to the required wait states.
>
> It is the flash power consumption. When you add wait states the power
> consumption flash drops to 50% (1 wait state) or 33% (2 wait states). Ie.
> the flash has identical power consumption at 24, 48 and 72MHz.
>
> Of course the secondary effect of adding wait states is the core slows down
> and so uses less power. Based on their numbers I estimate the slowdown is
> between 10 and 15% - not too bad for 2 wait states.

Yes, that is all pretty obvious.  But it does not address the point of
the Y intercept being a hefty 9 mA.  This is not as high as the Analog
Devices ARM parts, but it is significant.  It means you need to use
modes and hardware features to get better power savings compared to
just slowing the clock which is much simpler to do.

> > Accounting for
> > that, the mA/MHz ranges from 0.54 at 24 MHz to 0.88 at 72 MHz.  I
> > think this may be better than the Luminary Stellaris parts, but not as
> > good as the Atmel SAM7 parts which are claimed to be a true 0.5 mA/MHz
> > with very low static current in the uA range.  I have not looked at
> > the newer Luminary parts in detail.
>
> I calculate 40mA at 72MHz, so 0.56mA/MHz. Not quite 0.5, but close.
> But I don't see where you get the idea they are worse than SAM7. I'm not sure
> what part you were comparing with, but the SAM7A3 (also CAN and USB like
> STM32F103) shows 70mA at 60MHz, or more than twice at the same frequency.

The SAM7A3 is one of the oldest SAM7 parts and is not a useful basis
for comparison.  Personally, I do not expect to have a use for the CAN
controller and I don't expect it was running when the power
measurements were made.  I was using the SAM7S parts as a point of
comparison.  I have a spread sheet that was provided by Atmel which
shows the power rating of the CPU since you can control all the
various power consuming sections.  Ignoring the peripherals, the CPU
(with PLL running) consumes 0.5 mA/MHz with a very small Y intercept
(as I initially said).

The power for the STM32 is from the data sheet and includes basic
power to the peripherals, although since they are not performing work
the power they draw is less than typical.  So the comparison is not
perfect.

> Now consider that an M3 runs twice as fast as a SAM7 at the same frequency,
> so the MIPS/Watt is 4 times as good!

How do you support the claim that the M3 runs twice as fast as the
SAM7 at the same frequency???  Maybe I don't want to know...

I have not seen anyone claim that the M3 runs twice as fast as an ARM7
clock for clock.  I don't even think ARM claims that.  I seem to
recall that after all the hoopla is removed, you might see from 10% to
25% speedup from the ARM7 to the M3 depending on your application.  If
you disagree on this basic point, then I think we should not discuss
it further. I have seen it discussed before ad nauseum with no hard
information to support any given number.

> > Actually, I guess a power factor would be required for the SAM7 parts
> > as well since they run with one wait state at their top speed.  So
> > maybe the STM32 part do better on power than I realized!
>
> If you're trying to compare MIPS/Watt don't forget that different cores running
> at the same frequency do not run at the same speed.

Yes, but that is a small delta compared to adding waitstates with a 2x
or 3x reduction in performance and therefore the same effect on power
efficiency.

Reply by Bill Giovino ●June 23, 20072007-06-23

"Wilco Dijkstra" wrote...
>
> "rickman" wrote...
> > On Jun 22, 2:34 pm, Eric wrote:
> >> On Jun 22, 12:40 am, "Bill Giovino" wrote:
> >>
> >> > However, if you are running out of Flash with the CPU at a higher speed than the
Flash,
> >> > and so the Flash requires wait states while taking advantage of the Harvard
> >> > architecture
> >>
> >> Any idea if the ST Cortex M3 can run without wait states from flash at
> >> their rated speed? That would be quite impressive.
> >
> > The data sheet says it requires one wait state from 24 to 48 MHz and 2
> > wait states above 48 MHz.  So compared to the Luminary parts running
> > at 50 MHz with *NO* wait states, I say the ST M3 parts are dogs.
>
> It's not that bad. Cortex-M3 has a prefetch buffer and branch prediction. This
> means that the cost of a single waitstate can be hidden for conditional branches,
> ie. only indirect branches have a penalty. With 2 wait states the branch prediction
> only works on unconditional branches, so you'll get a slowdown. However you can
> change loops to use an unconditional branch at the end so they run at the speed
> of zero-wait state memory.

Completely correct. But you must remember that often devices like these are not often
used at their full speed.

ST certainly has excellent embedded Flash processes that can run faster than 24MHz and
they deliberately chose not to use any of them for this product. In the case of this
device, it looks like it was developed speifically for low power applications, where the
issue isn't really instructions per second, but milliamps per second. The intelligent
peripherals, and especially the non-intrusive DMA, allow developers to run the core
slower.

When competing with commodity devices, (and anything licensed from ARM has become a
commodity), a microcontroller company needs a competitive advantage. ST's advantage is
their superior in-house process technology. Only TI (who also licenses the ARM Cortex)
competes with ST when it comes to superior in-house process technology, and, hey, ST and
TI are so close in process ability I wouldn't bet on the difference between the two.

Bill Giovino
http://Microcontroller.com
http://www.microcontroller.com/news/arm_cortex_stm.asp