Atmel releasing FLASH AVR32 ?| page 5

Reply by Ulf Samuelsson ●March 22, 20072007-03-22

"Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet 
news:C_vMh.3351$5c2.346@newsfe3-win.ntli.net...
>
> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message 
> news:ettfd5$836$1@aioe.org...
>
>> spi_task(unsigned char *mbox);
>> {
>>    while(1);
>>        data = 0;
>>        waitfor(!CS);
>>        for(i = 0; i < 8; i++) {
>>            waitfor(SCK);
>>            data = (data << 1) | (MOSI);
>>            waitfor(!SCK);
>>        }
>>        send(mbox,data);
>>        waitfor(CS);
>>    }
>> }
>>
>> You have 40 tasks, each running a S/W slave SPI.
>> All SPIs have to run at the *same* frequency but are otherwise 
>> independent..
>> I.E: It may be the case that all clocks toggle at exactly
>>    the same time or no clock toggles at the same time as another clock.
>>
>> All SPIs must be handled concurrently.
>>
>>
>> What is the maximum fixed frequency you can accept,
>> with or without multithreading.
>
> Using polling in both cases would result in about the same max frequency.
> Assuming all ports run at the same frequency and are active then amount of
> code that needs to execute to receive 40 8-bit values is the same, whether
> multithreaded or not. If not all ports are active then multithreading has 
> much
> lower CPU utilization (as only a few threads are running).
>

    Lets see: to execute
>>            data = (data << 1) | (MOSI);
>>            waitfor(!SCK);

    we can assume the following assembler code:

    lsld         1,r0
    load        mosi,r1
    or           r1,r0
    waitfor    eventflag_1    ; YES ; H/W to support event wait

    So the multithreaded CPU will complete in 40 x 4 = 160 instructions

    Id like to see a single threaded CPU doing this in 160 instructions.

    I think an interrupt is probably 5-10 clocks and return from interrupt 
the same.
    So 10 clocks * 40 interrupts = 400 clocks to start with.
    I think you will run about 5 times slower.
    With more overhead for interrupts, much much slower.

> Using interrupts in both cases would result in about the same max 
> frequency.
> The maximum frequency is lower compared to polling (due to the interrupt
> latency overhead - twice as slow is possible in a worst case scenario).
> Multithreading will have a similar interrupt latency as taking an 
> interrupt is
> virtually identical to starting a new thread (some CPUs even switch to a
> different set of registers). The advantage of using interrupts is that CPU
> utilization is much lower if only a few SPI ports are active.
>

No, because a proper multithreaded architecture releases the pipeline
to computable threads when they do not need to be active.

> Peripherals typically have some buffering to reduce interrupt rate so the
> overhead is minimal (this is a little extra hardware, far less than 
> hardware
> multithreading needs) . Therefore the advantage of polling when all 
> devices
> are active is pretty small. So there is little difference between 
> multithreaded
> polling and non-multithreaded interrupts.
>
> If you're claiming that polling has lower CPU utilization in a 
> multithreaded
> environment then I agree. If you're claiming that interrupts have a large
> overhead if do you very little work per interrupt (ie. no buffering), then 
> I agree.
>

> But I still don't see any advantage inherent to multithreading.
>

If you do not need top performance in a single thread, you
can greatly simplify the pipeline and thus increase the frequency
of the CPU.
You are able to mix programs from several sources on a single CPU
instead of having several CPUs, because noone knows
how to maintain code from different sources.
Sometimes you don&#4294967295;t even get source of the firmware.

A classic example would be something implementing a V.22 modem in S/W.
You can have the V.22 S/W running in a thread, and you cannot
screw up the performance of the modem S/W.
By allocating a certain number of MIPS and guaranteeing
that the program is not stopped by application S/W running
at high priority, you have solved the problem.

Another example: today you can get single chip GPS.
To reduce cost, they are ROMmed, and you add an external
microcontroller to do the user interface.
At this stage, the ARM7 CPU running the GPS S/W needs
about 20 MIPS, and there is no plan to let anyone touch
the ARM, due to the sensitivity of the S/W.

With a multithreaded CPU you could allocate 20 MIPS for the GPS and
run the application S/W on the remining MIPS.

> Wilco



-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by Wilco Dijkstra ●March 22, 20072007-03-22

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:etub66$5ab$1@aioe.org...
> "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet 
> news:C2wMh.3352$5c2.86@newsfe3-win.ntli.net...

>> Starting a thread on an event is just as complex as handling an
>> interrupt.

> No, you start a thread containing a loop
> and in the beginning of the loop, you wait for an event.

Yes, that is possible if you have enough hardware threads.
If you do then you don't need traditional interrupts at all anymore
and only use polling. I can see that would simplify hardware and
software (I'm sure you agree polling is easier).

However few cores support more than 2 hardware threads as
having many large register files is a waste of die area and a
cycle time limiter. I can imagine keeping contexts in memory,
but that makes switching threads expensive, defeating the
advantage.

> Once that event occurs, the thread becomes computable
> and you can read data on the next CPU cycle.

That would be nice, but the reality is that it will take some time
before you start executing another thread. You always have the
event synchronization time and the thread startup time. You
avoid save/restore like in a traditional interrupt, but you still
have all the other overheads.

While it is possible to reduce this to a bare minimum (say
less than 5 cycles), you can do the same for interrupts. It's just
a design tradeoff whether you want the lowest possible latency
for added complexity and (likely) lower average performance.

Wilco

Reply by Ulf Samuelsson ●March 22, 20072007-03-22

>
>>> Starting a thread on an event is just as complex as handling an
>>> interrupt.
>
>> No, you start a thread containing a loop
>> and in the beginning of the loop, you wait for an event.
>
> Yes, that is possible if you have enough hardware threads.
> If you do then you don't need traditional interrupts at all anymore
> and only use polling. I can see that would simplify hardware and
> software (I'm sure you agree polling is easier).
>
> However few cores support more than 2 hardware threads as
> having many large register files is a waste of die area and a
> cycle time limiter. I can imagine keeping contexts in memory,
> but that makes switching threads expensive, defeating the
> advantage.
>

And you say that using two cores (which is the current solution)
is less of  a waste...
Show me a core which runs lets say Bluetooth MAC and GPS MAC
(or similar combination) in a single thread.

In fact, show me a single thread core which can do a
full duplex S/W UART at as high speed as a two thread core.

>> Once that event occurs, the thread becomes computable
>> and you can read data on the next CPU cycle.
>
> That would be nice, but the reality is that it will take some time
> before you start executing another thread.

No, zero cost context switch cores exist already today.
(And has existed for 20-30 years)

> You always have the
> event synchronization time and the thread startup time. You
> avoid save/restore like in a traditional interrupt, but you still
> have all the other overheads.

If we assume that we want a thread to react on an edge on an I/O pin,
then there will be a synchronisation delay from the edge to the time
when the event has been raised and changed the status of the
thread from "event wait" to "computable".
During that time, the CPU can execute other threads.

There is no thread startup time when you have a zero cost context
switch architecture. - Several are around.
This means, in the SPI slave example, that after the clock event is raised
suddenly all 40 threads become computable.
The CPU will switch thread every clock cycle, so after 40 clock cycles
the CPU will have executed:
    data <<= 1
for all the 40 threads.

After 80 clocks
    r0 = MISO
after 120 clocks
    data |= r0

>
> While it is possible to reduce this to a bare minimum (say
> less than 5 cycles),

Less than 5 cycles = 0 cycles in this case.

>  you can do the same for interrupts. It's just
> a design tradeoff whether you want the lowest possible latency
> for added complexity and (likely) lower average performance.
>
> Wilco



-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by Jim Granville ●March 22, 20072007-03-22

Wilco Dijkstra wrote:

> I stand by my claim that it is impossible to make code run with a fixed
> timing on current micro controllers (just to make it 100% clear, I mean
> non-trivial code, and dealing with realtime events).
> 
> Microcontrollers typically have different memory timings for the different
> memories, there are data-dependent instruction timings to worry about,
> so you need to write everything in assembler and carefully balance the
> timings of if/then statements. If you pass pointers then you'd need to the
> memory timing into account whereever the pointers are used.

  I think I'll leave it here, but observe that's quite a lot of
"qualifiers" you've now added, to the original statement,
including one that seems to shift the definition of Microcontroller ;)
  You see, not all Microcontrollers have such elastic memory timings.

  "non trivial" is also vague: most designers that go to the effort to
get time invariant code, consider that effort/code non-trivial, but
somehow I know you'll qualify that again....
-jg

Reply by Wilco Dijkstra ●March 22, 20072007-03-22

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:etuk0p$pj2$1@aioe.org...

>> However few cores support more than 2 hardware threads as
>> having many large register files is a waste of die area and a
>> cycle time limiter. I can imagine keeping contexts in memory,
>> but that makes switching threads expensive, defeating the
>> advantage.
>>
>
> And you say that using two cores (which is the current solution)
> is less of  a waste...

I didn't say that, but see below. For a small embedded core having 2
threads is maybe 25% extra area, 4 is more likely to be 50%. A faster
core replacing 2 smaller cores has around 50% overhead due to the
extra complexity to get faster cycle time. So we have:

1 simple core: 100%
2 simple cores: 200%
1 faster core: 150%
2-way multithreaded core: 188%
4-way multithreaded core: 225%

These are finger in the wind numbers but you can see a heavy
multithreaded core will be larger than several simple cores.

> Show me a core which runs lets say Bluetooth MAC and GPS MAC
> (or similar combination) in a single thread.

Any core that is fast enough will do. Merging two complex pieces
of software is obviously non-trivial but it would be equally non-trivial
to change them to use multiple threads.

> In fact, show me a single thread core which can do a
> full duplex S/W UART at as high speed as a two thread core.

Again any core will do: using polling you can reach the same speed
as a multithreaded core. A good way of doing this is to start with
an interrupt, then poll for a while when receiving high-speed data
and revert to interrupts again when there are pauses in the data.
This way you don't lock up the CPU except when you receive data.

>>> Once that event occurs, the thread becomes computable
>>> and you can read data on the next CPU cycle.
>>
>> That would be nice, but the reality is that it will take some time
>> before you start executing another thread.
>
> No, zero cost context switch cores exist already today.
> (And has existed for 20-30 years)

Can you mention one? I've seen the Ubicom cores but they
switch at the start of the (rather long) pipeline, so it takes many
cycles to switch.

>> You always have the
>> event synchronization time and the thread startup time. You
>> avoid save/restore like in a traditional interrupt, but you still
>> have all the other overheads.
>
> If we assume that we want a thread to react on an edge on an I/O pin,
> then there will be a synchronisation delay from the edge to the time
> when the event has been raised and changed the status of the
> thread from "event wait" to "computable".
> During that time, the CPU can execute other threads.

Absolutely.

> There is no thread startup time when you have a zero cost context
> switch architecture. - Several are around.

When you say zero cost context switch, can you tell me how long it
would take to execute a "wait_for_event" instruction, the thread going
to sleep followed by the event being signaled immediately afterwards
followed by resuming execution of the next instruction? On the Ubicom
core I believe it takes around 10 cycles, far from zero...

> This means, in the SPI slave example, that after the clock event is raised
> suddenly all 40 threads become computable.
> The CPU will switch thread every clock cycle, so after 40 clock cycles
> the CPU will have executed:
>    data <<= 1
> for all the 40 threads.

In the best case yes, but in the worst case I described above it would be
more like 400 cycles.

But let's assume you have a CPU with a zero-cost context switch. Now I
assume a CPU with zero-cost interrupt latency. Is there really any difference?

Wilco

Reply by Ulf Samuelsson ●March 22, 20072007-03-22

>>> However few cores support more than 2 hardware threads as
>>> having many large register files is a waste of die area and a
>>> cycle time limiter. I can imagine keeping contexts in memory,
>>> but that makes switching threads expensive, defeating the
>>> advantage.
>>>
>>
>> And you say that using two cores (which is the current solution)
>> is less of  a waste...
>
> I didn't say that, but see below. For a small embedded core having 2
> threads is maybe 25% extra area, 4 is more likely to be 50%. A faster
> core replacing 2 smaller cores has around 50% overhead due to the
> extra complexity to get faster cycle time. So we have:
>
> 1 simple core: 100%
> 2 simple cores: 200%
> 1 faster core: 150%
> 2-way multithreaded core: 188%
> 4-way multithreaded core: 225%
>

Why not use *REAL* data.

MIPS 34k core with 9 threads = 2,1 mm2 in 90 nm.
MIPS 24k core with 1 thread   = 2,8 mm2 in 130 nm

It is probaly fair to assume that 90 nm = 0,5 * 130 nm
so a MIPS 34k would be about 4,2 mm2 in 130 nm
or about 50 % larger with 9 threads.

The MIPS 34k is actually a dual core (dual VPE), so you have to deduct for 
that.

I think you will find that it is more like 10% overhead for a simple core
It is less overhead for a multithreaded "faster" core than it is for
a single threaded "faster" core, if you accept the limitation
that a thread can only run max 1/2 or 1/3rd of the cycles
because you get rid of feedback muxes.
Less logic in critical datapath = higher frequency.


> These are finger in the wind numbers but you can see a heavy
> multithreaded core will be larger than several simple cores.

I think the finger is up somewhere... and that wind ain't nice.


>> Show me a core which runs lets say Bluetooth MAC and GPS MAC
>> (or similar combination) in a single thread.
>
> Any core that is fast enough will do. Merging two complex pieces
> of software is obviously non-trivial but it would be equally non-trivial
> to change them to use multiple threads.

No, you run one thread with the Bluetooth MAC and another for the GPS MAC.
No or very little change needed...

>
>> In fact, show me a single thread core which can do a
>> full duplex S/W UART at as high speed as a two thread core.
>
> Again any core will do: using polling you can reach the same speed
> as a multithreaded core. A good way of doing this is to start with
> an interrupt, then poll for a while when receiving high-speed data
> and revert to interrupts again when there are pauses in the data.
> This way you don't lock up the CPU except when you receive data.
>
>>>> Once that event occurs, the thread becomes computable
>>>> and you can read data on the next CPU cycle.
>>>
>>> That would be nice, but the reality is that it will take some time
>>> before you start executing another thread.
>>
>> No, zero cost context switch cores exist already today.
>> (And has existed for 20-30 years)
>
> Can you mention one? I've seen the Ubicom cores but they
> switch at the start of the (rather long) pipeline, so it takes many
> cycles to switch.

Are you sure, they cannot switch every clock cycle?
MIPS 34k.
In a simple three stage pipeline is it a piece of cake to do what I want.
Main cost is:

PC is changed from a register to an SRAM.
Register Bank becomes a register bank array.
Multiple PSRs

and then you have the scheduling which can be
an advanced timer working on a register bank.

Each thread adds a time quanta every n cycles and  deducts another time
quanta every time it gets to use the pipeline and you try to execute the
threads which have accumulated a lot of time quanta.
Not so hard to implement.



>
>>> You always have the
>>> event synchronization time and the thread startup time. You
>>> avoid save/restore like in a traditional interrupt, but you still
>>> have all the other overheads.
>>
>> If we assume that we want a thread to react on an edge on an I/O pin,
>> then there will be a synchronisation delay from the edge to the time
>> when the event has been raised and changed the status of the
>> thread from "event wait" to "computable".
>> During that time, the CPU can execute other threads.
>
> Absolutely.
>
>> There is no thread startup time when you have a zero cost context
>> switch architecture. - Several are around.
>
> When you say zero cost context switch, can you tell me how long it
> would take to execute a "wait_for_event" instruction, the thread going
> to sleep followed by the event being signaled immediately afterwards
> followed by resuming execution of the next instruction? On the Ubicom
> core I believe it takes around 10 cycles, far from zero...

The zero context switch time is between two different threads.
If you explicitly yield the thread, then it can take time
to stop/start, but in fine grained parallelism, you
execute for one clock and then the next clock another thread
executes.


>
>> This means, in the SPI slave example, that after the clock event is 
>> raised
>> suddenly all 40 threads become computable.
>> The CPU will switch thread every clock cycle, so after 40 clock cycles
>> the CPU will have executed:
>>    data <<= 1
>> for all the 40 threads.
>
> In the best case yes, but in the worst case I described above it would be
> more like 400 cycles.
>
> But let's assume you have a CPU with a zero-cost context switch. Now I
> assume a CPU with zero-cost interrupt latency. Is there really any 
> difference?

Show me one ;-)

You will not be able to maintain a large number of equal prioritized threads
unless you modify the concept of interrupts to be equal to multithreading.

>
> Wilco
>



-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by Wilco Dijkstra ●March 23, 20072007-03-23

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:etvukp$ioj$1@aioe.org...

> Why not use *REAL* data.
>
> MIPS 34k core with 9 threads = 2,1 mm2 in 90 nm.
> MIPS 24k core with 1 thread   = 2,8 mm2 in 130 nm
>
> It is probaly fair to assume that 90 nm = 0,5 * 130 nm
> so a MIPS 34k would be about 4,2 mm2 in 130 nm
> or about 50 % larger with 9 threads.

130->90nm scaling is more like 55-60%, so it is more likely to be
25% larger, not 50%. However consider these are high-end embedded
cores with 32KB cache, so the actual core area more than doubles.

> The MIPS 34k is actually a dual core (dual VPE), so you have to deduct for that.

Actually it is a single core. A VPE is simply a virtual CPU to make the
OS believe there are 2 cores.

> I think you will find that it is more like 10% overhead for a simple core

Wrong. On a micro controller with a far simpler pipeline it would be much
worse. A while ago we discussed the size of a register file in embedded
CPUs, this is 10-15% of a typical core like ARM7. Imagine 9 copies...

> It is less overhead for a multithreaded "faster" core than it is for
> a single threaded "faster" core, if you accept the limitation
> that a thread can only run max 1/2 or 1/3rd of the cycles
> because you get rid of feedback muxes.
> Less logic in critical datapath = higher frequency.

That is certainly feasible, but you'll have a hard time getting it past
marketing types who want to show good benchmarking results...
Single threaded performance is still important and will be for a long
time.

>>> No, zero cost context switch cores exist already today.
>>> (And has existed for 20-30 years)
>>
>> Can you mention one? I've seen the Ubicom cores but they
>> switch at the start of the (rather long) pipeline, so it takes many
>> cycles to switch.
>
> Are you sure, they cannot switch every clock cycle?

Of course they can switch every clock cycle. But what matters is how
fast they can react to asynchronous events such as branch mispredicts,
cachemisses, wait for event etc. If a thread is scheduled to run but it
has an unexpected idle cycle, is it possible to immediately switch to
another thread and use that cycle? Remember a bubble may appear
at the end of the pipeline but instruction fetch is at the beginning, so
it can take a while...

> MIPS 34k.

I don't have much information how threading works on the 34k, but
from what little is available, it appears each thread maintains a
separate instruction queue. This indicates they can switch pretty
quickly. I'd be impressed if it can switch to reclaim idle cycles.

> In a simple three stage pipeline is it a piece of cake to do what I want.
> Main cost is:
>
> PC is changed from a register to an SRAM.
> Register Bank becomes a register bank array.
> Multiple PSRs
>
> and then you have the scheduling which can be
> an advanced timer working on a register bank.
>
> Each thread adds a time quanta every n cycles and  deducts another time
> quanta every time it gets to use the pipeline and you try to execute the
> threads which have accumulated a lot of time quanta.
> Not so hard to implement.

The concept is simple indeed, but the details are non-trivial,
especially if you want fast thread switching to use idle cycles.

>> When you say zero cost context switch, can you tell me how long it
>> would take to execute a "wait_for_event" instruction, the thread going
>> to sleep followed by the event being signaled immediately afterwards
>> followed by resuming execution of the next instruction? On the Ubicom
>> core I believe it takes around 10 cycles, far from zero...
>
> The zero context switch time is between two different threads.
> If you explicitly yield the thread, then it can take time
> to stop/start, but in fine grained parallelism, you
> execute for one clock and then the next clock another thread
> executes.

Yes. But my point is that if it takes time to start/stop threads then this is
equivalent to the interrupt latency. You can't claim that interrupt latency
is bad for performance but that thread start/stop latency isn't. It lowers
the maximum performance of that thread (in your example of 40 SPI
devices it lowers the maximum SPI frequency) and if the CPU can fill
the idle cycles with another thread they also reduce overall performance.

>> But let's assume you have a CPU with a zero-cost context switch. Now I
>> assume a CPU with zero-cost interrupt latency. Is there really any difference?
>
> Show me one ;-)

Any multithreaded CPU with a zero-cost context switch will do. You're
claiming those exist, right? So zero-cost interrupt latency exists too.

> You will not be able to maintain a large number of equal prioritized threads
> unless you modify the concept of interrupts to be equal to multithreading.

If I run a main thread and have a higher priority interrupt thread servicing
interrupts using 100% of CPU time, do you agree it is identical to an
interrupt-based CPU? So an interrupt driven application can be as fast
as a multithreaded one.

Wilco

Reply by Ulf Samuelsson ●March 23, 20072007-03-23

"Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet
news:fcUMh.24370$Lz4.2747@newsfe7-gui.ntli.net...
>
> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message
> news:etvukp$ioj$1@aioe.org...
>
>> Why not use *REAL* data.
>>
>> MIPS 34k core with 9 threads = 2,1 mm2 in 90 nm.
>> MIPS 24k core with 1 thread   = 2,8 mm2 in 130 nm
>>
>> It is probaly fair to assume that 90 nm = 0,5 * 130 nm
>> so a MIPS 34k would be about 4,2 mm2 in 130 nm
>> or about 50 % larger with 9 threads.
>
> 130->90nm scaling is more like 55-60%, so it is more likely to be
> 25% larger, not 50%. However consider these are high-end embedded
> cores with 32KB cache, so the actual core area more than doubles.

From MIPS homepage:
" 2.1 mm2 (core only, extracted from full layout GDSII database)"

>
>> The MIPS 34k is actually a dual core (dual VPE), so you have to deduct
>> for that.
>
> Actually it is a single core. A VPE is simply a virtual CPU to make the
> OS believe there are 2 cores.
>
>> I think you will find that it is more like 10% overhead for a simple core
>
> Wrong. On a micro controller with a far simpler pipeline it would be much
> worse. A while ago we discussed the size of a register file in embedded
> CPUs, this is 10-15% of a typical core like ARM7. Imagine 9 copies.

I  meant per thread.
You do not need much more than the register file and prefetch buffer
so 10-15% extra per thread does not seem unreasonable.

A dual thread 40 MHz CPU can replace two 20 MHz CPUs.
A single thread 40 MHz CPU cannot always replace two 20 MHz CPUs.
Let's take an obvious case, where one is running the OSE operating system
and the other is running Thread/X.
How are you going to do that on a single thread?
The combined GPS and Bluetooth stack is better.
A GPS company would normally not allow anyone to mess
with the code running on the ARM.
The impact on support and maintenance is to high.

Running a thread with the GPS is much more attractive
and would allow the user to run their own threads without
affecting the GPS timing enough to be a problem.

>
>> It is less overhead for a multithreaded "faster" core than it is for
>> a single threaded "faster" core, if you accept the limitation
>> that a thread can only run max 1/2 or 1/3rd of the cycles
>> because you get rid of feedback muxes.
>> Less logic in critical datapath = higher frequency.
>
> That is certainly feasible, but you'll have a hard time getting it past
> marketing types who want to show good benchmarking results...
> Single threaded performance is still important and will be for a long
> time.

Not for a 20 MIPS application, it aint.
There is noone interested in how many MIPS the cpu
core in a GPS chip has.

>
>>>> No, zero cost context switch cores exist already today.
>>>> (And has existed for 20-30 years)
>>>
>>> Can you mention one? I've seen the Ubicom cores but they
>>> switch at the start of the (rather long) pipeline, so it takes many
>>> cycles to switch.
>>
>> Are you sure, they cannot switch every clock cycle?
>
> Of course they can switch every clock cycle. But what matters is how
> fast they can react to asynchronous events such as branch mispredicts,
> cachemisses, wait for event etc. If a thread is scheduled to run but it
> has an unexpected idle cycle, is it possible to immediately switch to
> another thread and use that cycle?

Yes, when you have a jump you would immediately make this task
non--computable, and have another computable thread enter the pipeline.
If it becomes computable the next clock, you can switch it in again.

> Remember a bubble may appear
> at the end of the pipeline but instruction fetch is at the beginning, so
> it can take a while...
>
>> MIPS 34k.
>
> I don't have much information how threading works on the 34k, but
> from what little is available, it appears each thread maintains a
> separate instruction queue. This indicates they can switch pretty
> quickly. I'd be impressed if it can switch to reclaim idle cycles.

Why not, the AVR32 removes jumps from the pipeline
so the execution unit will only see aritmetic instructions.

>
>> In a simple three stage pipeline is it a piece of cake to do what I want.
>> Main cost is:
>>
>> PC is changed from a register to an SRAM.
>> Register Bank becomes a register bank array.
>> Multiple PSRs
>>
>> and then you have the scheduling which can be
>> an advanced timer working on a register bank.
>>
>> Each thread adds a time quanta every n cycles and  deducts another time
>> quanta every time it gets to use the pipeline and you try to execute the
>> threads which have accumulated a lot of time quanta.
>> Not so hard to implement.
>
> The concept is simple indeed, but the details are non-trivial,
> especially if you want fast thread switching to use idle cycles.

I expect that in normal operation you will switch thread EVERY clock cycle.
It is becomes more complex if you want dynamic allocation of threads.

A real simple solution would be to have a circular buffer of programmable
size.
Each entry in the buffer, is a thread number.
So if you had a 10 entry circular buffer you could have

1,2,1,3,1,2,1,4,1,5

At 100 MHz, this would give you
Thread 1: 5 entries = 50 MHz
Thread 2: 2 entries = 20 MHz
Thread 3,4,5 = 1 entry each = 10 MHz

If a thread is not computable, then you can give the cycle to
one of the other threads, or to a dbug thread, or to a backgorund thread
or whatever.


>>> When you say zero cost context switch, can you tell me how long it
>>> would take to execute a "wait_for_event" instruction, the thread going
>>> to sleep followed by the event being signaled immediately afterwards
>>> followed by resuming execution of the next instruction? On the Ubicom
>>> core I believe it takes around 10 cycles, far from zero...
>>
>> The zero context switch time is between two different threads.
>> If you explicitly yield the thread, then it can take time
>> to stop/start, but in fine grained parallelism, you
>> execute for one clock and then the next clock another thread
>> executes.
>
> Yes. But my point is that if it takes time to start/stop threads then this
> is
> equivalent to the interrupt latency. You can't claim that interrupt
> latency
> is bad for performance but that thread start/stop latency isn't. It lowers
> the maximum performance of that thread (in your example of 40 SPI
> devices it lowers the maximum SPI frequency) and if the CPU can fill
> the idle cycles with another thread they also reduce overall performance.
>

No, but I say, that it does not reduce the total throughput of
a CPU that you have latencies.
Even with latencies, you can get a higher utilization of the pipeline
as long as there is at least 1 computable thread.
No bubbles in the pipeline, no branch prediction needed.
Branch prediction will improve the performance of a single thread
but it will not allow the CPU to execute more instructions.

I believe that a thread that replaces an interrupt is started
already at initialization, and put in an event wait state.
Since there is no context to save/restore, then the thread
can react much faster than an interrupt driven device.


>>> But let's assume you have a CPU with a zero-cost context switch. Now I
>>> assume a CPU with zero-cost interrupt latency. Is there really any
>>> difference?
>>
>> Show me one ;-)
>
> Any multithreaded CPU with a zero-cost context switch will do. You're
> claiming those exist, right? So zero-cost interrupt latency exists too.

I am not claiming that a multithreaded CPU has zero interrupt latency.
I am claiming that once it has been decided to switch thread
you can do it without any overhead. It is still going to take time
after an event has occured, before the decision has been made.

You were trying to prove that a single thread core is as good
as a multithreaded core, and now you are claiming that
a multithreaded core is as good as a multithreaded core , duh!

Again, show me a real CPU with zero cost interrupt latency


>> You will not be able to maintain a large number of equal prioritized
>> threads
>> unless you modify the concept of interrupts to be equal to
>> multithreading.
>
> If I run a main thread and have a higher priority interrupt thread
> servicing
> interrupts using 100% of CPU time, do you agree it is identical to an
> interrupt-based CPU? So an interrupt driven application can be as fast
> as a multithreaded one.

If you go back to the case where you are servicing 40 slave SPIs
you will NOT get the same throughput in a single thread machine
simply because you have overhead in servicing the interrupt
and the fact that you will not interrupt another task which
has the same priority level.


>
> Wilco
>
>

Do you EVER give up a lost cause?

-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

Reply by Jim Granville ●March 28, 20072007-03-28

Ulf Samuelsson wrote:
>>>>Starting a thread on an event is just as complex as handling an
>>>>interrupt.
>>
>>>No, you start a thread containing a loop
>>>and in the beginning of the loop, you wait for an event.
>>
>>Yes, that is possible if you have enough hardware threads.
>>If you do then you don't need traditional interrupts at all anymore
>>and only use polling. I can see that would simplify hardware and
>>software (I'm sure you agree polling is easier).
>>
>>However few cores support more than 2 hardware threads as
>>having many large register files is a waste of die area and a
>>cycle time limiter. I can imagine keeping contexts in memory,
>>but that makes switching threads expensive, defeating the
>>advantage.
>>
> 
> 
> And you say that using two cores (which is the current solution)
> is less of  a waste...
> Show me a core which runs lets say Bluetooth MAC and GPS MAC
> (or similar combination) in a single thread.
> 
> In fact, show me a single thread core which can do a
> full duplex S/W UART at as high speed as a two thread core.

On the subject of Multiple cores, and multiple threads, news today
shows this is advancing quite quickly. Intel does not seem to
think it is a 'waste of die area'.....

Eight cores and 16 threads (probably they mean per-core?) is impressive
for what sound like fairly mainstream cores.

http://www.eetimes.com/news/semi/showArticle.jhtml?articleID=198700787

"Intel's 45-nm high-k process technology offers approximately twice the 
transistor budget, 20 percent faster transistor switching speed and 
lower leakage current when compared with the company's 65-nm technology, 
Gelsinger said.

Nehalem's scalable architecture provides for between one and 16 or more 
threads utilizing one to eight or more cores, Gelsinger said. He added 
that Nehalem processors already in design have eight cores and 16 
threads. Some Nehalem processors are likely to have more cores, he said, 
declining to discuss specific product configurations.

Nehalem's architecture provides for simultaneous multi-threading, 
multi-level shared cache, Gelsinger said."

-jg

Reply by Jim Granville ●April 1, 20072007-04-01

Data sheets and info on Eval PCB, etc, are now up at

http://www.atmel.com/dyn/general/updates.asp?cboDocType=0&cboFamily=0&btnSubmit=Submit

-jg

Previous 3 456 7 Next

Atmel releasing FLASH AVR32 ?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group