Interrupt latency

Hi,

I am writing my own real-time kernel for x86. Now I face something
really strange (or may be rather it's not; it has been some time since
I was in the details of x86 microarchitecture).

I measured CPU clocks elapsed between the first assembly instruction
executed at interrupt's entry point in IDT and beginning of the C code
of user-defined interrupt handler and the result was a big
surprise :-) It took about 2500 cycles despite that I have only a
handful of assembly instructions before a call to user-supplied IRQ
handler.

A little more testing showed that almost all cycles (2300+) were spent
at access to a global variable (via ds:[] addressing). Nothing that
accesses stack memory (push, pop, call, mov) makes a noticeable
difference. Does anybody have an idea why this happens? I test on
Celeron 2.8G in protected mode set up for flat model with paging
disabled.

I can post the code of the interrupt's entry point (until the C entry
point is called), but it's rather trivial and not long.

Thanks,
D

Reply by ●March 16, 20082008-03-16

On Sun, 16 Mar 2008 16:02:18 -0700, Stargazer wrote:

> Hi,
> 
> I am writing my own real-time kernel for x86. Now I face something
> really strange (or may be rather it's not; it has been some time since I
> was in the details of x86 microarchitecture).
> 
> I measured CPU clocks elapsed between the first assembly instruction
> executed at interrupt's entry point in IDT and beginning of the C code
> of user-defined interrupt handler and the result was a big surprise :-)
> It took about 2500 cycles despite that I have only a handful of assembly
> instructions before a call to user-supplied IRQ handler.
> 
> A little more testing showed that almost all cycles (2300+) were spent
> at access to a global variable (via ds:[] addressing). Nothing that
> accesses stack memory (push, pop, call, mov) makes a noticeable
> difference. Does anybody have an idea why this happens? I test on
> Celeron 2.8G in protected mode set up for flat model with paging
> disabled.
> 
> I can post the code of the interrupt's entry point (until the C entry
> point is called), but it's rather trivial and not long.
> 
> Thanks,
> D

Cache latency?

-- 
Tim Wescott
Control systems and communications consulting
http://www.wescottdesign.com

Need to learn how to apply control theory in your embedded system?
"Applied Control Theory for Embedded Systems" by Tim Wescott
Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html

Reply by ●March 16, 20082008-03-16

On Mar 16, 4:02 pm, Stargazer  <spamt...@crayne.org> wrote:
> Hi,
>
> I am writing my own real-time kernel for x86. Now I face something
> really strange (or may be rather it's not; it has been some time since
> I was in the details of x86 microarchitecture).
>
> I measured CPU clocks elapsed between the first assembly instruction
> executed at interrupt's entry point in IDT and beginning of the C code
> of user-defined interrupt handler and the result was a big
> surprise :-) It took about 2500 cycles despite that I have only a
> handful of assembly instructions before a call to user-supplied IRQ
> handler.
>
> A little more testing showed that almost all cycles (2300+) were spent
> at access to a global variable (via ds:[] addressing). Nothing that
> accesses stack memory (push, pop, call, mov) makes a noticeable
> difference. Does anybody have an idea why this happens? I test on
> Celeron 2.8G in protected mode set up for flat model with paging
> disabled.
>
> I can post the code of the interrupt's entry point (until the C entry
> point is called), but it's rather trivial and not long.
>
> Thanks,
> D

What are the min, max and average cycle counts (you need to repeat the
measurement many times)?
What are the numbers on other PCs?

I wonder if it's SMIs. On my Dell Latitude D610 notebook an SMI (or a
short burst of thereof) may take up to ~240K cycles, which is ~150
microseconds at 1.6 GHz; on the old Compaq Armada 7800 notebook it's
only 12K cycles or ~40 microseconds at 300 MHz. Hardware bugfixes and
control are moving into the CPU. :(

Alex

Reply by ●March 16, 20082008-03-16

Stargazer wrote:
> Hi,
>
> I am writing my own real-time kernel for x86. Now I face something
> really strange (or may be rather it's not; it has been some time since
> I was in the details of x86 microarchitecture).
>
> I measured CPU clocks elapsed between the first assembly instruction
> executed at interrupt's entry point in IDT and beginning of the C code
> of user-defined interrupt handler and the result was a big
> surprise :-) It took about 2500 cycles despite that I have only a
> handful of assembly instructions before a call to user-supplied IRQ
> handler.
>
> A little more testing showed that almost all cycles (2300+) were spent
> at access to a global variable (via ds:[] addressing). Nothing that
> accesses stack memory (push, pop, call, mov) makes a noticeable
> difference. Does anybody have an idea why this happens? I test on
> Celeron 2.8G in protected mode set up for flat model with paging
> disabled.
>
> I can post the code of the interrupt's entry point (until the C entry
> point is called), but it's rather trivial and not long.
>
> Thanks,
> D

I cannot help you but just want to thank you for the educating (for
me) post.
Demonstrates I have been right to stay away from x86 (I have
considered
it 2 or 3 times every 5 years or so, and have not done so last 8 years
IIRC).
The latency is probably compiler generated, of course, but this does
not make things any better :-).

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

Reply by ●March 16, 20082008-03-16

On Mar 16, 7:02&#4294967295;pm, Stargazer  <spamt...@crayne.org> wrote:

> difference. Does anybody have an idea why this happens? I test on
> Celeron 2.8G in protected mode set up for flat model with paging

God knows. Cache line fill, maybe. Bus contention with a shared-memory
graphics adapter. Any one of a million things. x86 in V86 mode is a
nondeterministic architecture; design accordingly.

Reply by ●March 17, 20082008-03-17

Stargazer wrote:
> Hi,
> 
> I am writing my own real-time kernel for x86. Now I face something
> really strange (or may be rather it's not; it has been some time since
> I was in the details of x86 microarchitecture).
> 
> I measured CPU clocks elapsed between the first assembly instruction
> executed at interrupt's entry point in IDT and beginning of the C code
> of user-defined interrupt handler and the result was a big
> surprise :-) It took about 2500 cycles despite that I have only a
> handful of assembly instructions before a call to user-supplied IRQ
> handler.
> 
> A little more testing showed that almost all cycles (2300+) were spent
> at access to a global variable (via ds:[] addressing). Nothing that
> accesses stack memory (push, pop, call, mov) makes a noticeable
> difference. Does anybody have an idea why this happens? I test on
> Celeron 2.8G in protected mode set up for flat model with paging
> disabled.

That's unlikely. 2300 cycles is about 1us, no memory is that slow. That 
looks more like ISA bus access time. Do you have any I/O instructions or 
MMIO accesses there, by any chance? How do you measure cycles, did you 
remember to use synchronizing instructions with RDTSC?

> I can post the code of the interrupt's entry point (until the C entry
> point is called), but it's rather trivial and not long.

That would be helpful.

Reply by ●March 17, 20082008-03-17


Stargazer wrote:
> Hi,
> 
> I am writing my own real-time kernel for x86. Now I face something
> really strange (or may be rather it's not; it has been some time since
> I was in the details of x86 microarchitecture).
> 
> I measured CPU clocks elapsed between the first assembly instruction
> executed at interrupt's entry point in IDT and beginning of the C code
> of user-defined interrupt handler and the result was a big
> surprise :-) It took about 2500 cycles despite that I have only a
> handful of assembly instructions before a call to user-supplied IRQ
> handler.
> 
> A little more testing showed that almost all cycles (2300+) were spent
> at access to a global variable (via ds:[] addressing). Nothing that
> accesses stack memory (push, pop, call, mov) makes a noticeable
> difference. Does anybody have an idea why this happens? I test on
> Celeron 2.8G in protected mode set up for flat model with paging
> disabled.

Something is not right.

The biggest part of latency is created by the SDRAM Tpre + Tras + Trcd + 
Tcl + Tburst. The ds:[] access also causes the shadow descriptor miss. 
So in the worst case it translates to the two SDRAM bursts to flush the 
dirty cache lines, and another two bursts to reload. I would expect the 
delay at the order of hundreds of the CPU cycles.


Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Reply by ●March 17, 20082008-03-17

Stargazer wrote:

...
> I measured CPU clocks elapsed between the first assembly instruction
> executed at interrupt's entry point in IDT and beginning of the C code
> of user-defined interrupt handler and the result was a big
> surprise :-) It took about 2500 cycles despite that I have only a
> handful of assembly instructions before a call to user-supplied IRQ
> handler.

A naked IRQ (just EOI and IRET) takes about 200 cycles, if measured in
an unproteced environment (ie: my OS where all code run with PL0).

One big cycle loss is in the PL-transition, especially if hardware
task-switches are in use.

> A little more testing showed that almost all cycles (2300+) were spent
> at access to a global variable (via ds:[] addressing). Nothing that
> accesses stack memory (push, pop, call, mov) makes a noticeable
> difference. Does anybody have an idea why this happens? I test on
> Celeron 2.8G in protected mode set up for flat model with paging
> disabled.

If there is no WBINVD instruction (~2000 cycles) in your code
I can just guess what may happen here

user code runs PL=3...
IRQ          : ->PL=0
user hooks   : PL0->PL3
global access: PL3->PL0->PL3
end of hook  : PL3->PL0
IRET         : PL0->PL3

__
wolfgang

Reply by ●March 17, 20082008-03-17

Stargazer wrote:

...
> I measured CPU clocks elapsed between the first assembly instruction
> executed at interrupt's entry point in IDT and beginning of the C code
> of user-defined interrupt handler and the result was a big
> surprise :-) It took about 2500 cycles despite that I have only a
> handful of assembly instructions before a call to user-supplied IRQ
> handler.

A naked IRQ (just EOI and IRET) takes about 200 cycles, if measured in
an unproteced environment (ie: my OS where all code run with PL0).

One big cycle loss is in the PL-transition, especially if hardware
task-switches are in use.

> A little more testing showed that almost all cycles (2300+) were spent
> at access to a global variable (via ds:[] addressing). Nothing that
> accesses stack memory (push, pop, call, mov) makes a noticeable
> difference. Does anybody have an idea why this happens? I test on
> Celeron 2.8G in protected mode set up for flat model with paging
> disabled.

If there is no WBINVD instruction (~2000 cycles) in your code
I can just guess what may happen here

user code runs PL=3...
IRQ          : ->PL=0
user hooks   : PL0->PL3
global access: PL3->PL0->PL3
end of hook  : PL3->PL0
IRET         : PL0->PL3

__
wolfgang

Reply by ●March 17, 20082008-03-17

On Mar 17, 2:38&#4294967295;am, "Alexei A. Frounze"  <spamt...@crayne.org> wrote:
> On Mar 16, 4:02 pm, Stargazer &#4294967295;<spamt...@crayne.org> wrote:
>
>
>
>
>
> > Hi,
>
> > I am writing my own real-time kernel for x86. Now I face something
> > really strange (or may be rather it's not; it has been some time since
> > I was in the details of x86 microarchitecture).
>
> > I measured CPU clocks elapsed between the first assembly instruction
> > executed at interrupt's entry point in IDT and beginning of the C code
> > of user-defined interrupt handler and the result was a big
> > surprise :-) It took about 2500 cycles despite that I have only a
> > handful of assembly instructions before a call to user-supplied IRQ
> > handler.
>
> > A little more testing showed that almost all cycles (2300+) were spent
> > at access to a global variable (via ds:[] addressing). Nothing that
> > accesses stack memory (push, pop, call, mov) makes a noticeable
> > difference. Does anybody have an idea why this happens? I test on
> > Celeron 2.8G in protected mode set up for flat model with paging
> > disabled.
>
> > I can post the code of the interrupt's entry point (until the C entry
> > point is called), but it's rather trivial and not long.
>
> > Thanks,
> > D
>
> What are the min, max and average cycle counts (you need to repeat the
> measurement many times)?
> What are the numbers on other PCs?

A weird thing is that the difference between min and max is about 10
cycles. That is, results are fairly accurate and consistent.
I didn't test on other PCs yet.

> I wonder if it's SMIs. On my Dell Latitude D610 notebook an SMI (or a
> short burst of thereof) may take up to ~240K cycles, which is ~150
> microseconds at 1.6 GHz; on the old Compaq Armada 7800 notebook it's
> only 12K cycles or ~40 microseconds at 300 MHz. Hardware bugfixes and
> control are moving into the CPU. :(

I don't know. It's a single instruction that accounts for over 2000
cycles, I can point the instruction but don't understand the reason.
It's a read-modify-write (INC ds:[xxx]) and it has to do something
with the nature of instruction being RMW. Actually I have a BT ds:
[xxx] (read) several instructions before it, which doesn't cause
anything abnormal.

It sounds weird if SMI would somehow be triggered on each and any
hardware interrupt.


D

Previous12 3 Next

Interrupt latency

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group