>> The latency for
>>                  THREADING    INTERRUPT
>> ___________________________________
>> thread B             2                       1              clocks
>> thread A             1                       8              clocks
>>
>> If you are interested in worst case performance, then the interrupt
>> structure is 4 times slower in reacting to the event.
>
> Correct, execution of the first instruction is slower indeed. However
> this has nothing to do with the maximum frequency of the SPIs...
>
> In both cases the fastest we can receive bits from the SPIs is 2 bits
> every 16 cycles. Irrespectively of how many SPIs you emulate, maximum
> SPI frequency depends on the total time taken of the interrupt routine.
> So clearly the latency of the first instruction does not matter at all.
>

You dont see how it scales.
If we assume 50/50 duyty cycle on the SPI clock.
The SPI slaves reads on positive edge and the
SPI masters alter data on the negative edge.
Data when SPI clock is low, is to be considered INVALID.

The interrupts occur on the positive edge, and the master must
keep data valid and clock high until the last interrupt has sampled its I/O.

If you have 16 SPIs, then the master cannot release the data until the last 
of the
16 interrupt routines has sampled its I/O.

So it can release after (15 * 8) + 1 = 121 clocks forcing total SPI cycle to 
be 2 * 121 = 242 cycles.
If we assume that interrupt processing take a number of clock cycles
(12 in case of Cortex) you add 16 * 12 = 192 clocks,  to 121 =  313 for half 
a period
giving total cycle = 626 cycles.

In the multithreading case, it can release after 16 cycles but is also 
limited
by the execution time of each thread so you will have an SPI cycle of 16 * 8 
= 128 cycles.

It is really very simple if you open your eyes.



>> If you have more interrupts, then it can take forever and ever
>> for the last interrupt to handle it input pin.
>
> Only if higher priority interrupts occur. The interrupt structure is 
> designed
> so that each interrupt can meet its worst-case deadline. This is similar 
> to
> allocating time quanta, but rather than guaranteeing a timeslot that is 
> fast
> enough to handle the worst case, you guarantee the worst case by setting
> interrupt priorities. A different methodology, but the end result is the 
> same.
>

No it isn't..
Worst case latency for multithreading is in this case 16 clock cycles
and 121 clock cycles for the interrupt case.
Interrupts fall to the ground when you have the same priority.

Also remember that one key function is that multithreading allows two
groups to develop S/W independent of each other and let a third party
use that S/W as a library.
If you run a real time operating systems where the two threads have to
share, then it becomes a mess, which usually results in having two CPUs
instead of two threads.


>> With the right allocation stucture you can, in a multithreaded CPU
>> guarantee that you are allocated a certain number of instructions
>> per time quanta. This is what you need to support worst case performance.
>
> Sure (with time quanta things become more predictable but the latency
> goes up too - you can't have both!). However I'm still at a loss as to why
> you claim that multithreading would allow for higher frequency SPIs...
>

Just do the numbers...


> Wilco



-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:ev7uof$md1$1@aioe.org...
> "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet 
> news:ysyRh.2432$gr2.319@newsfe4-gui.ntli.net...
>>
>> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:ev5v86$nt5$1@aioe.org...
>>
>> I think this is the crux of the problem, so let's address this first:
>>
>>> Tell me how your interrupt system will make the pipeline execute
>>> instructions for two interrupts A and B occuring in the same time.
>>
>> Neither case can execute the instructions of both interrupts
>> exactly at the same time (only a multicore would execute A1
>> and B1 at the same time, not serially like below).
>>
>>> A1:B1:A2:B2:A3:B3:A4:B4:A5:B5:A6:B6:A7:B7
>>>
>>> Instead of
>>>
>>> B1:B2:B3:B4:B5:B6:B7:A1:A2:A3:A4:A5:A6:A7
>>>
>>> Which I believe is the normal way for interrupts to behave...
>>>
>>> You may want to note the time until both threads/interrupt
>>
>> The code executes in the order as you wrote above (assuming
>> one instruction per cycle). In both cases interrupt handling starts
>> and stops exactly at the same time, so there is no difference in
>> total interrupt latency. In both cases instructions are executed
>> serially, but with different interleaving. However any interleaving
>> (like A1:A2:A3:B1:B2:B3:B4:B5:B6:A4:A5:A6:A7:B7) is correct
>> as the interrupts are independent.
>>
>> Now where do you see a problem? If you do, please remember
>> that just about all CPUs today execute interrupts serially without
>> any issues, and that multithreaded CPUs do interleave instructions
>> differently depending on circumstances (eg. other interrupts).
>
>
> If instruction A1 and B1 both read the SPI slave data from an I/O port,
> the SPI masters can release the data already when B1 has completed in case 1
> which is after 2 clock cycles.

That's a big if... You'd usually need some more instructions to signal
you've read the bit, so it is not necessarily the first instruction that is
critical. Anyway, it doesn't matter, see below.

> If you adopt an interrupt structure, then the SPI masters can only release
> the data after 8 clocks in the second case.

Correct. This will delay the next interrupt for A so that next time round
interrupts for A and B are not received at the same time.

> Your interrupt structure is in this case 4 times slower...

More accurately the first instruction has a 3 times higher latency, while
the last instruction has 75% of the latency. And when averaged over all
instructions the latency of both cases is the same...

> The latency for
>                  THREADING    INTERRUPT
> ___________________________________
> thread B             2                       1              clocks
> thread A             1                       8              clocks
>
> If you are interested in worst case performance, then the interrupt
> structure is 4 times slower in reacting to the event.

Correct, execution of the first instruction is slower indeed. However
this has nothing to do with the maximum frequency of the SPIs...

In both cases the fastest we can receive bits from the SPIs is 2 bits
every 16 cycles. Irrespectively of how many SPIs you emulate, maximum
SPI frequency depends on the total time taken of the interrupt routine.
So clearly the latency of the first instruction does not matter at all.

> If you have more interrupts, then it can take forever and ever
> for the last interrupt to handle it input pin.

Only if higher priority interrupts occur. The interrupt structure is designed
so that each interrupt can meet its worst-case deadline. This is similar to
allocating time quanta, but rather than guaranteeing a timeslot that is fast
enough to handle the worst case, you guarantee the worst case by setting
interrupt priorities. A different methodology, but the end result is the same.

> With the right allocation stucture you can, in a multithreaded CPU
> guarantee that you are allocated a certain number of instructions
> per time quanta. This is what you need to support worst case performance.

Sure (with time quanta things become more predictable but the latency
goes up too - you can't have both!). However I'm still at a loss as to why
you claim that multithreading would allow for higher frequency SPIs...

Wilco

"Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet 
news:ysyRh.2432$gr2.319@newsfe4-gui.ntli.net...
>
> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message 
> news:ev5v86$nt5$1@aioe.org...
>
> I think this is the crux of the problem, so let's address this first:
>
>> Tell me how your interrupt system will make the pipeline execute
>> instructions for two interrupts A and B occuring in the same time.
>
> Neither case can execute the instructions of both interrupts
> exactly at the same time (only a multicore would execute A1
> and B1 at the same time, not serially like below).
>
>> A1:B1:A2:B2:A3:B3:A4:B4:A5:B5:A6:B6:A7:B7
>>
>> Instead of
>>
>> B1:B2:B3:B4:B5:B6:B7:A1:A2:A3:A4:A5:A6:A7
>>
>> Which I believe is the normal way for interrupts to behave...
>>
>> You may want to note the time until both threads/interrupt
>
> The code executes in the order as you wrote above (assuming
> one instruction per cycle). In both cases interrupt handling starts
> and stops exactly at the same time, so there is no difference in
> total interrupt latency. In both cases instructions are executed
> serially, but with different interleaving. However any interleaving
> (like A1:A2:A3:B1:B2:B3:B4:B5:B6:A4:A5:A6:A7:B7) is correct
> as the interrupts are independent.
>
> Now where do you see a problem? If you do, please remember
> that just about all CPUs today execute interrupts serially without
> any issues, and that multithreaded CPUs do interleave instructions
> differently depending on circumstances (eg. other interrupts).


If instruction A1 and B1 both read the SPI slave data from an I/O port,
the SPI masters can release the data already when B1 has completed in case 1
which is after 2 clock cycles.

If you adopt an interrupt structure, then the SPI masters can only release
the data after 8 clocks in the second case.

Your interrupt structure is in this case 4 times slower...

The latency for
                  THREADING    INTERRUPT
___________________________________
thread B             2                       1              clocks
thread A             1                       8              clocks

If you are interested in worst case performance, then the interrupt
structure is 4 times slower in reacting to the event.
If you have more interrupts, then it can take forever and ever
for the last interrupt to handle it input pin.


With the right allocation stucture you can, in a multithreaded CPU
guarantee that you are allocated a certain number of instructions
per time quanta. This is what you need to support worst case performance.

> Wilco
>

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:ev5v86$nt5$1@aioe.org...

I think this is the crux of the problem, so let's address this first:

> Tell me how your interrupt system will make the pipeline execute
> instructions for two interrupts A and B occuring in the same time.

Neither case can execute the instructions of both interrupts
exactly at the same time (only a multicore would execute A1
and B1 at the same time, not serially like below).

> A1:B1:A2:B2:A3:B3:A4:B4:A5:B5:A6:B6:A7:B7
>
> Instead of
>
> B1:B2:B3:B4:B5:B6:B7:A1:A2:A3:A4:A5:A6:A7
>
> Which I believe is the normal way for interrupts to behave...
>
> You may want to note the time until both threads/interrupt

The code executes in the order as you wrote above (assuming
one instruction per cycle). In both cases interrupt handling starts
and stops exactly at the same time, so there is no difference in
total interrupt latency. In both cases instructions are executed
serially, but with different interleaving. However any interleaving
(like A1:A2:A3:B1:B2:B3:B4:B5:B6:A4:A5:A6:A7:B7) is correct
as the interrupts are independent.

Now where do you see a problem? If you do, please remember
that just about all CPUs today execute interrupts serially without
any issues, and that multithreaded CPUs do interleave instructions
differently depending on circumstances (eg. other interrupts).

Wilco

>>>> Multithreading on a high end general purpose CPU gives problem on their 
>>>> own.
>>>> Especially with cache trashing.
>>>
>>> Absolutely. The "solution" is to add more cache...
>>
>> No, the solution is to have more associativity in the cache.
>> Having 4GB of direct mapped cache will not help you when
>> two threads start using the same cache line.
>
> No. If you switch between threads in a finegrained way you need to ensure
> that the working set of each thread stays in the cache. This means the 
> cache
> needs to be large enough to hold the code and data from several threads.
> The problem is that L1 caches are often too small even for a single 
> thread...

Again you do not read, or you may not be aware of the difference
between a direct mapped cache and a set-associative cache.
And your memory is failing as well, as I am proposing tightly coupled
memory without any cache for all threads except the "application" thread


> Associativity is not an issue at all, most caches are already 4 or 8-way 
> set
> associative. If it were feasible, a 4GB direct mapped cache would not 
> thrash
> at all as no threads would ever use the same line.
>
Direct mapped means that for each memory location there is exactly
one location in the cache which can fit that word.
Since your cache is not the same size as the primary memory
you have for each location in the cache a large number of memory
locations which will only fit into that cache location.
If all threads happen to access a memory location where all
locations map into the same cache location, you have terrible cache 
trashing.
Read a book on caches...


>>>> With an embedded core where you use tightly coupled high bandwidth 
>>>> memory
>>>> for most of the threads you do not have that problem
>>>
>>> Same solution: more fast on-chip memory.
>>
>> If you want to solve the problem, general purpose for symmetric 
>> multiprocessing
>> by putting the application memory on the chip, you are going to run into 
>> significant
>> problems.
>> You are beginning to get out of touch with reality, my dear friend.
>
> The current trend is clear: more on-chip memory either as caches or
> tightly coupled memory. And FYI there are no problems with symmetric
> multiprocessing, people have been doing it for many years. Cache
> coherency is a well understood problem, even high-end ARMs have it.

And way too expensive, if you can solve it with a multithreaded core 
connected to TCM.

>> In order for interrupts to be equivalent to multithreading,
>> where you can select a new executing an instruction from
>> an interrupt every new clock cycle, you have to add
>> additional constraint to your "interrupt" system.
>>
>> You have to have multiple register files and multiple program counters in 
>> the system.
>> You have to add additional hardware to dynamically raise/lower priorities
>> in order to distribute instructions among the different interrupts.
>> Your "interrupt" driven system is likely to be mistaken for a 
>> multithreading system.
>
> Is it really that difficult to understand? Let me explain it in a 
> different way.
>
> Start with the MIPS 34k core, and assign 1 thread to the main task and
> the others to one interrupt each. Set the thread priority of the interrupt
> threads to infinite. At this point the CPU behaves exactly like an 
> interrupt
> driven core that uses special registers on an interrupt (many do so,
> including ARM). If you can only ever run one thread, you can't mistake 
> this
> for a multithreaded core.


>
> From the other perspective, in an interrupt drive core you typically 
> associate
> a function with each interrupt. There is *nothing* that prevents a CPU 
> from
> prefetching the first few instructions of some or all interrupt routines. 
> In
> combination with the use of special registers to avoid save/restore
> overhead, this can significantly reduce interrupt latency.
>
> Now tell me what the difference is between the above 2 cases.
> Do you still believe interrupts and threads are not closely related?
>


Tell me how your interrupt system will make the pipeline execute
instructions for two interrupts A and B occuring in the same time.

A1:B1:A2:B2:A3:B3:A4:B4:A5:B5:A6:B6:A7:B7

Instead of

B1:B2:B3:B4:B5:B6:B7:A1:A2:A3:A4:A5:A6:A7

Which I believe is the normal way for interrupts to behave...

You may want to note the time until both threads/interrupt

>> Your way of discussion is way off , you ignore ALL arguments
>> and requests to prove your point, in favour of continued rambling...
>>
>> You need to show that the given example (Multiple SPI slaves)
>> can be handled equally well by an *existing* interrupt driven system
>> as well as how it can be handled by an *existing* multithreaded
>> system like the zero context switch cost MIPS processor,
>
> Done, that, please reread my old posts. I have also shown that any
> zero-cost context switch multithreaded CPU (if it exists) can behave
> like a zero-cost interrupt based CPU.
>

No, you have not shown that an interrupt based CPU can interleave
instructions in the way a multithreaded core can do it.
Your "zero" interrupt latency core does not and will not exist.

> However you haven't shown a 40-thread CPU capable of running your
> example. Without one thread for each interrupt you need to use traditional
> interrupt handling rather than polling for events. Most embedded systems
> need more than the 8 interrupts/threads MIPS could handle, especially
> when combining 2 or more existing cores into 1 as you suggest.

Again you refrain from answering.
I have shown the MIPS threaded core, and running 40 threads
on such a core is a simple extension of the basic concept.
If it makes you happier, then try do it with 8 threads you
can fit into the MIPS core.

>>>> If you continue, that just proves that you are either ignorant or not 
>>>> listening
>>>
>>> That kind of response is not helping your case. If you believe I'm 
>>> wrong,
>>> then why not prove me wrong with some hard facts and data?
>>>
>> I already did.
>> I showed that there exist zero context switch cost MIPS processor.
>
> No you didn't. The MIPS core can switch between threads on every
> cycle, but that doesn't imply zero cost context switch on an interrupt.
>

I have never tried to prove that there is zero cost interrupts
That is your idea which will never fly.


>> You have not shown that there exist zero cost interrupts.
>
> There is no such thing as zero-cost interrupt. There are a few CPUs that
> can respond extremely quickly (eg. Transputer, Forth chips). However there
> is a tradeoff between the need for fast execution of normal code and fast
> interrupt response time.
>
>> If go back to the example.
>>
>> You have a fixed clock.
>> This is used by a number of SPI masters to provide data to your chip.
>> Your chip implements SPI slaves and each SPI slave should run
>> in a separate task/thread or whatever.
>> The communication on each SPI slave channels is totally different
>> and should be developed by two teams which do not communicate
>> between each other and they are not aware of each other.
>> once per byte, the SPI data is written to memory and
>> an event flag register  private to the thread/interrupt is written.
>>
>> They are aware of the execution environment, which in the interrupt case
>> is the RTOS and how interrupts are handled
>>
>> Using one multithreaded and one interrupt processor, with frequency 
>> scaled so the top level
>> of MIPS is equivalent, show that you can implement the SPI slave.
>
> I've already described 2 ways of doing it, reread my old posts.
> If you think it is not possible, please explain why exactly you think 
> that,
> then I'll explain the fallacy in your argument.

Done earlier in this post. You cannot interleave instructions at a 
pretermined rate.

>
>>> The T1 has tiny caches and stalls on a cachemiss unlike any other
>>> high-end out-of-order CPU, so they require more threads to keep going
>>> if one thread stalls. It is also designed for highly multithreaded 
>>> workloads,
>>> so having more thread contexts means fewer context switches in software,
>>> which can be a big win on workloads running on UNIX/Windows (realtime
>>> OSes are far better at these things).
>>
>> It is the other way around. *Because* you have many threads you CAN
>> stall a thread on a cache miss, without affecting the total throughput
>> of the CPU.
>
> For the same amount of hardware, more threads means less space for
> caches, so more cache misses. More cache misses means you need
> more threads. Typical chicken and egg situation...
>

If you dont have a cache, you dont get any cache misses.

>>It is very likely that the T1 shoves more instructions
>> per clock cycle than a "high end, branch prediction, out of order" single
>> or dual thread CPU.
>
> Actually T1 benchmarks are very disappointing: with twice the number
> of cores and 8 times the number of threads the T1 does not even get
> close to Opteron or Woodcrest on heavily multithreaded benchmarks...
>
> It doesn't mean the whole idea is bad, I think the next generation will do
> much better (and so will AMD/Intel). However claiming that an in-order
> multithreaded CPU will easily outperform an out-of-order CPU on total
> work done is total rubbish.
>
>>>> I do not think that they are limited by Intels vision...
>>>> Also I pointed you at the new MIPS Multithreading core.
>>>> They certainly do not agree with You!
>>>
>>> If you do not understand the differences between cores like Itanium-2,
>>> Pentium-4, Nehalem, Power5, Power6 (all 2-way multithreaded),
>>> and cores like the T1, MIPS34K and Ubicom (8+ -way threaded),
>>> then you're not the expert on multithreading you claim to be.
>>>
>> You seems to want to slip into a discussion which type
>> of CPU will exhibit the highest MIPS rate for a single thread.
>> That is trying to force open an already open door.
>
> No, I wasn't talking about fast single thread performance. My point is 
> that
> it is a fallacy to think that adding more and more threads is always 
> better.
> Like so many other things, returns diminish while costs increase. I claim
> it would be a waste to add more threads on an out-of-order core (max
> frequency would go down, more cache needed to reclaim performance
> loss, so not cost effective).
>

If you can replace a full core with a thread you always win.

Obviously you are not going to take the time to go through the
SPI slave example which proves you wrong.
I suspect the reason is that you know you are wrong
but to stiff headed to admit it,
so I consider any future discussion on this subject with You a total waste 
of time.

> Wilco
>


-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:ev4t4i$d4c$1@aioe.org...
> "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet 
> news:e8eRh.2250$gr2.1244@newsfe4-gui.ntli.net...
>>
>> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:ev2i1h$qhh$1@aioe.org...
>>
>>> Multithreading on a high end general purpose CPU gives problem on their own.
>>> Especially with cache trashing.
>>
>> Absolutely. The "solution" is to add more cache...
>
> No, the solution is to have more associativity in the cache.
> Having 4GB of direct mapped cache will not help you when
> two threads start using the same cache line.

No. If you switch between threads in a finegrained way you need to ensure
that the working set of each thread stays in the cache. This means the cache
needs to be large enough to hold the code and data from several threads.
The problem is that L1 caches are often too small even for a single thread...

Associativity is not an issue at all, most caches are already 4 or 8-way set
associative. If it were feasible, a 4GB direct mapped cache would not thrash
at all as no threads would ever use the same line.

>>> With an embedded core where you use tightly coupled high bandwidth memory
>>> for most of the threads you do not have that problem
>>
>> Same solution: more fast on-chip memory.
>
> If you want to solve the problem, general purpose for symmetric multiprocessing
> by putting the application memory on the chip, you are going to run into significant
> problems.
> You are beginning to get out of touch with reality, my dear friend.

The current trend is clear: more on-chip memory either as caches or
tightly coupled memory. And FYI there are no problems with symmetric
multiprocessing, people have been doing it for many years. Cache
coherency is a well understood problem, even high-end ARMs have it.

> In order for interrupts to be equivalent to multithreading,
> where you can select a new executing an instruction from
> an interrupt every new clock cycle, you have to add
> additional constraint to your "interrupt" system.
>
> You have to have multiple register files and multiple program counters in the system.
> You have to add additional hardware to dynamically raise/lower priorities
> in order to distribute instructions among the different interrupts.
> Your "interrupt" driven system is likely to be mistaken for a multithreading system.

Is it really that difficult to understand? Let me explain it in a different way.

Start with the MIPS 34k core, and assign 1 thread to the main task and
the others to one interrupt each. Set the thread priority of the interrupt
threads to infinite. At this point the CPU behaves exactly like an interrupt
driven core that uses special registers on an interrupt (many do so,
including ARM). If you can only ever run one thread, you can't mistake this
for a multithreaded core.

From the other perspective, in an interrupt drive core you typically associate
a function with each interrupt. There is *nothing* that prevents a CPU from
prefetching the first few instructions of some or all interrupt routines. In
combination with the use of special registers to avoid save/restore
overhead, this can significantly reduce interrupt latency.

Now tell me what the difference is between the above 2 cases.
Do you still believe interrupts and threads are not closely related?

> Your way of discussion is way off , you ignore ALL arguments
> and requests to prove your point, in favour of continued rambling...
>
> You need to show that the given example (Multiple SPI slaves)
> can be handled equally well by an *existing* interrupt driven system
> as well as how it can be handled by an *existing* multithreaded
> system like the zero context switch cost MIPS processor,

Done, that, please reread my old posts. I have also shown that any
zero-cost context switch multithreaded CPU (if it exists) can behave
like a zero-cost interrupt based CPU.

However you haven't shown a 40-thread CPU capable of running your
example. Without one thread for each interrupt you need to use traditional
interrupt handling rather than polling for events. Most embedded systems
need more than the 8 interrupts/threads MIPS could handle, especially
when combining 2 or more existing cores into 1 as you suggest.

>>> If you continue, that just proves that you are either ignorant or not listening
>>
>> That kind of response is not helping your case. If you believe I'm wrong,
>> then why not prove me wrong with some hard facts and data?
>>
> I already did.
> I showed that there exist zero context switch cost MIPS processor.

No you didn't. The MIPS core can switch between threads on every
cycle, but that doesn't imply zero cost context switch on an interrupt.

> You have not shown that there exist zero cost interrupts.

There is no such thing as zero-cost interrupt. There are a few CPUs that
can respond extremely quickly (eg. Transputer, Forth chips). However there
is a tradeoff between the need for fast execution of normal code and fast
interrupt response time.

> If go back to the example.
>
> You have a fixed clock.
> This is used by a number of SPI masters to provide data to your chip.
> Your chip implements SPI slaves and each SPI slave should run
> in a separate task/thread or whatever.
> The communication on each SPI slave channels is totally different
> and should be developed by two teams which do not communicate
> between each other and they are not aware of each other.
> once per byte, the SPI data is written to memory and
> an event flag register  private to the thread/interrupt is written.
>
> They are aware of the execution environment, which in the interrupt case
> is the RTOS and how interrupts are handled
>
> Using one multithreaded and one interrupt processor, with frequency scaled so the top 
> level
> of MIPS is equivalent, show that you can implement the SPI slave.

I've already described 2 ways of doing it, reread my old posts.
If you think it is not possible, please explain why exactly you think that,
then I'll explain the fallacy in your argument.

>> The T1 has tiny caches and stalls on a cachemiss unlike any other
>> high-end out-of-order CPU, so they require more threads to keep going
>> if one thread stalls. It is also designed for highly multithreaded workloads,
>> so having more thread contexts means fewer context switches in software,
>> which can be a big win on workloads running on UNIX/Windows (realtime
>> OSes are far better at these things).
>
> It is the other way around. *Because* you have many threads you CAN
> stall a thread on a cache miss, without affecting the total throughput
> of the CPU.

For the same amount of hardware, more threads means less space for
caches, so more cache misses. More cache misses means you need
more threads. Typical chicken and egg situation...

>It is very likely that the T1 shoves more instructions
> per clock cycle than a "high end, branch prediction, out of order" single
> or dual thread CPU.

Actually T1 benchmarks are very disappointing: with twice the number
of cores and 8 times the number of threads the T1 does not even get
close to Opteron or Woodcrest on heavily multithreaded benchmarks...

It doesn't mean the whole idea is bad, I think the next generation will do
much better (and so will AMD/Intel). However claiming that an in-order
multithreaded CPU will easily outperform an out-of-order CPU on total
work done is total rubbish.

>>> I do not think that they are limited by Intels vision...
>>> Also I pointed you at the new MIPS Multithreading core.
>>> They certainly do not agree with You!
>>
>> If you do not understand the differences between cores like Itanium-2,
>> Pentium-4, Nehalem, Power5, Power6 (all 2-way multithreaded),
>> and cores like the T1, MIPS34K and Ubicom (8+ -way threaded),
>> then you're not the expert on multithreading you claim to be.
>>
> You seems to want to slip into a discussion which type
> of CPU will exhibit the highest MIPS rate for a single thread.
> That is trying to force open an already open door.

No, I wasn't talking about fast single thread performance. My point is that
it is a fallacy to think that adding more and more threads is always better.
Like so many other things, returns diminish while costs increase. I claim
it would be a waste to add more threads on an out-of-order core (max
frequency would go down, more cache needed to reclaim performance
loss, so not cost effective).

Wilco

On Wed, 04 Apr 2007 23:49:48 GMT, "Wilco Dijkstra"
<Wilco_dot_Dijkstra@ntlworld.com> wrote:

>
>"Jim Granville" <no.spam@designtools.maps.co.nz> wrote in message 
>news:460af1e0$1@clear.net.nz...
>
>> On the subject of Multiple cores, and multiple threads, news today
>> shows this is advancing quite quickly. Intel does not seem to
>> think it is a 'waste of die area'.....
>
>If you read what I wrote then you'd know that on a high end CPU it
>takes far less area than on a low end CPU. However Intel must still
>think it is a waste of die area, otherwise all their CPUs would have it...
>
>It is required now as 8 cores on a single chip use so much
>bandwidth that most cores are waiting for external memory most
>of the time (despite the huge L2 and L3 caches). Switching to a
>different thread on a cache miss makes sense in this case.
>
>> Eight cores and 16 threads (probably they mean per-core?) is impressive
>> for what sound like fairly mainstream cores.
>
>It clearly says 2 threads per core. Any more would be a waste.

The IP3000 from Ubicom supports 8 threads in hardware. Their solution 
seems to me to be a very good solution for multithreading in hardware,
where one needs deterministic response from all threads.
It looks like they essentially switch between instruction streams in
hardware such that from a software point of view each thread runs as
if it is the only thread, but running on a CPU with only a percentage
of the total speed. 

Regards
  Anton Erasmus

"Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> skrev i meddelandet 
news:e8eRh.2250$gr2.1244@newsfe4-gui.ntli.net...
>
> "Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message 
> news:ev2i1h$qhh$1@aioe.org...
>
>> Multithreading on a high end general purpose CPU gives problem on their 
>> own.
>> Especially with cache trashing.
>
> Absolutely. The "solution" is to add more cache...

No, the solution is to have more associativity in the cache.
Having 4GB of direct mapped cache will not help you when
two threads start using the same cache line.

>> With an embedded core where you use tightly coupled high bandwidth memory
>> for most of the threads you do not have that problem
>
> Same solution: more fast on-chip memory.

If you want to solve the problem, general purpose for symmetric 
multiprocessing
by putting the application memory on the chip, you are going to run into 
significant
problems.
You are beginning to get out of touch with reality, my dear friend.

>> I think it is eminently useful for assymetric multiprocessing where
>> you have some dedicated tasks to do which are best implemented
>> in a separate CPU to avoid real time response conflicts and can
>> be implemented in a low end 32 bitter.
>
> I'm not quite sure what you're saying here. Are you advocating
> asymmetric multiprocessing or asymmetric multithreading?
>

I am saysing that it is cheaper to use asymmetric multithreading
than asymmetric multiprocessing..

>> I think you need to stop trying to explain why a single CPU
>> is better than a multiththreaded CPU, because noone is
>> using a single CPU for implementing two simulaneously
>> operating software MACs.
>
> First of all, you're the one that claims one CPU is better than 2...
> I believe 2 CPUs is better in many cases - multicore is the future.
> However if you do move to a single (faster) CPU then it doesn't
> make much difference in terms of realtime response whether that
> CPU is multithreaded or not. You seem to believe that threads are
> somehow much better than interrupts - but as I've shown they are
> equivalent concepts.

In order for interrupts to be equivalent to multithreading,
where you can select a new executing an instruction from
an interrupt every new clock cycle, you have to add
additional constraint to your "interrupt" system.

You have to have multiple register files and multiple program counters in 
the system.
You have to add additional hardware to dynamically raise/lower priorities
in order to distribute instructions among the different interrupts.
Your "interrupt" driven system is likely to be mistaken for a multithreading 
system.

Your way of discussion is way off , you ignore ALL arguments
and requests to prove your point, in favour of continued rambling...

You need to show that the given example (Multiple SPI slaves)
can be handled equally well by an *existing* interrupt driven system
as well as how it can be handled by an *existing* multithreaded
system like the zero context switch cost MIPS processor,

I now put the flip on the shoulder, can you concentrate to that instead of 
rambling?


>
>> If you continue, that just proves that you are either ignorant or not 
>> listening
>
> That kind of response is not helping your case. If you believe I'm wrong,
> then why not prove me wrong with some hard facts and data?
>

I already did.
I showed that there exist zero context switch cost MIPS processor.
You have not shown that there exist zero cost interrupts.

If go back to the example.

You have a fixed clock.
This is used by a number of SPI masters to provide data to your chip.
Your chip implements SPI slaves and each SPI slave should run
in a separate task/thread or whatever.
The communication on each SPI slave channels is totally different
and should be developed by two teams which do not communicate
between each other and they are not aware of each other.
once per byte, the SPI data is written to memory and
an event flag register  private to the thread/interrupt is written.

They are aware of the execution environment, which in the interrupt case
is the RTOS and how interrupts are handled

Using one multithreaded and one interrupt processor, with frequency scaled 
so the top level
of MIPS is equivalent, show that you can implement the SPI slave.


>> The issues is replacing multiple CPUs/Memory Subsystems
>> with a single multithreaded CPU addressing a memory subsystem&#4294967295;
>> consisting of internal TCM memory, internal loosely coupled
>> memory (flash?)  and external memory.
>
> Most realtime CPUs have some form of  fast internal memory,
> this is not relevant to multithreading.
>
>>>> Eight cores and 16 threads (probably they mean per-core?) is impressive
>>>> for what sound like fairly mainstream cores.
>>>
>>> It clearly says 2 threads per core. Any more would be a waste.
>>>
>>
>> Look at Sun and UltraSparc T1, they certainly do not see the boundaries 
>> that you see.
>
> The T1 has tiny caches and stalls on a cachemiss unlike any other
> high-end out-of-order CPU, so they require more threads to keep going
> if one thread stalls. It is also designed for highly multithreaded 
> workloads,
> so having more thread contexts means fewer context switches in software,
> which can be a big win on workloads running on UNIX/Windows (realtime
> OSes are far better at these things).

It is the other way around. *Because* you have many threads you CAN
stall a thread on a cache miss, without affecting the total throughput
of the CPU. It is very likely that the T1 shoves more instructions
per clock cycle than a "high end, branch prediction, out of order" single
or dual thread CPU.


>
>> I do not think that they are limited by Intels vision...
>> Also I pointed you at the new MIPS Multithreading core.
>> They certainly do not agree with You!
>
> If you do not understand the differences between cores like Itanium-2,
> Pentium-4, Nehalem, Power5, Power6 (all 2-way multithreaded),
> and cores like the T1, MIPS34K and Ubicom (8+ -way threaded),
> then you're not the expert on multithreading you claim to be.
>

You seems to want to slip into a discussion which type
of CPU will exhibit the highest MIPS rate for a single thread.
That is trying to force open an already open door.


> Wilco
>



-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

"Ulf Samuelsson" <ulf@a-t-m-e-l.com> wrote in message news:ev2i1h$qhh$1@aioe.org...

> Multithreading on a high end general purpose CPU gives problem on their own.
> Especially with cache trashing.

Absolutely. The "solution" is to add more cache...

> With an embedded core where you use tightly coupled high bandwidth memory
> for most of the threads you do not have that problem

Same solution: more fast on-chip memory.

> I think it is eminently useful for assymetric multiprocessing where
> you have some dedicated tasks to do which are best implemented
> in a separate CPU to avoid real time response conflicts and can
> be implemented in a low end 32 bitter.

I'm not quite sure what you're saying here. Are you advocating
asymmetric multiprocessing or asymmetric multithreading?

> I think you need to stop trying to explain why a single CPU
> is better than a multiththreaded CPU, because noone is
> using a single CPU for implementing two simulaneously
> operating software MACs.

First of all, you're the one that claims one CPU is better than 2...
I believe 2 CPUs is better in many cases - multicore is the future.
However if you do move to a single (faster) CPU then it doesn't
make much difference in terms of realtime response whether that
CPU is multithreaded or not. You seem to believe that threads are
somehow much better than interrupts - but as I've shown they are
equivalent concepts.

> If you continue, that just proves that you are either ignorant or not listening

That kind of response is not helping your case. If you believe I'm wrong,
then why not prove me wrong with some hard facts and data?

> The issues is replacing multiple CPUs/Memory Subsystems
> with a single multithreaded CPU addressing a memory subsystem&#4294967295;
> consisting of internal TCM memory, internal loosely coupled
> memory (flash?)  and external memory.

Most realtime CPUs have some form of  fast internal memory,
this is not relevant to multithreading.

>>> Eight cores and 16 threads (probably they mean per-core?) is impressive
>>> for what sound like fairly mainstream cores.
>>
>> It clearly says 2 threads per core. Any more would be a waste.
>>
>
> Look at Sun and UltraSparc T1, they certainly do not see the boundaries that you see.

The T1 has tiny caches and stalls on a cachemiss unlike any other
high-end out-of-order CPU, so they require more threads to keep going
if one thread stalls. It is also designed for highly multithreaded workloads,
so having more thread contexts means fewer context switches in software,
which can be a big win on workloads running on UNIX/Windows (realtime
OSes are far better at these things).

> I do not think that they are limited by Intels vision...
> Also I pointed you at the new MIPS Multithreading core.
> They certainly do not agree with You!

If you do not understand the differences between cores like Itanium-2,
Pentium-4, Nehalem, Power5, Power6 (all 2-way multithreaded),
and cores like the T1, MIPS34K and Ubicom (8+ -way threaded),
then you're not the expert on multithreading you claim to be.

Wilco

>
>> On the subject of Multiple cores, and multiple threads, news today
>> shows this is advancing quite quickly. Intel does not seem to
>> think it is a 'waste of die area'.....
>
> If you read what I wrote then you'd know that on a high end CPU it
> takes far less area than on a low end CPU. However Intel must still
> think it is a waste of die area, otherwise all their CPUs would have it...
>

Multithreading on a high end general purpose CPU gives problem on their own.
Especially with cache trashing.
With an embedded core where you use tightly coupled high bandwidth memory
for most of the threads you do not have that problem

Note I am not advocating symmetric multiprocessing.

I think it is eminently useful for assymetric multiprocessing where
you have some dedicated tasks to do which are best implemented
in a separate CPU to avoid real time response conflicts and can
be implemented in a low end 32 bitter.

I think you need to stop trying to explain why a single CPU
is better than a multiththreaded CPU, because noone is
using a single CPU for implementing two simulaneously
operating software MACs.
If you continue, that just proves that you are either ignorant or not 
listening

The issues is replacing multiple CPUs/Memory Subsystems
with a single multithreaded CPU addressing a memory subsystem&#4294967295;
consisting of internal TCM memory, internal loosely coupled
memory (flash?)  and external memory.



> It is required now as 8 cores on a single chip use so much
> bandwidth that most cores are waiting for external memory most
> of the time (despite the huge L2 and L3 caches). Switching to a
> different thread on a cache miss makes sense in this case.
>
>> Eight cores and 16 threads (probably they mean per-core?) is impressive
>> for what sound like fairly mainstream cores.
>
> It clearly says 2 threads per core. Any more would be a waste.
>

Look at Sun and UltraSparc T1, they certainly do not see the boundaries that 
you see.
I do not think that they are limited by Intels vision...
Also I pointed you at the new MIPS Multithreading core.
They certainly do not agree with You!

> Wilco



-- 
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB