EmbeddedRelated.com
Forums

Interrupt latency

Started by Unknown March 16, 2008
> Cache latency?
That's the first thing that jumps into mind, but... over 2000 cycles? OTOH, may it be that L2 cache on Celeron (or on-board L3?) is organized so badly that it read-modify-write to memory *always* causes a cache miss? Interesting, I ought to test what happens if there is an off-chip cache and if I can disable it. D
In article <7366de24-bc1c-47f1-9fc1-6de7d66d6e5a@s37g2000prg.googlegroups.com>,
Stargazer   <spamtrap@crayne.org> wrote:
>Hi, > >I am writing my own real-time kernel for x86. Now I face something >really strange (or may be rather it's not; it has been some time since >I was in the details of x86 microarchitecture). > >I measured CPU clocks elapsed between the first assembly instruction >executed at interrupt's entry point in IDT and beginning of the C code >of user-defined interrupt handler and the result was a big >surprise :-) It took about 2500 cycles despite that I have only a >handful of assembly instructions before a call to user-supplied IRQ >handler. > >A little more testing showed that almost all cycles (2300+) were spent >at access to a global variable (via ds:[] addressing).
Is this a SINGLE memory access or several/many accesses?
>Nothing that >accesses stack memory (push, pop, call, mov) makes a noticeable >difference. Does anybody have an idea why this happens? I test on >Celeron 2.8G in protected mode set up for flat model with paging >disabled. > >I can post the code of the interrupt's entry point (until the C entry >point is called), but it's rather trivial and not long.
Please feel free to post the code so there doesn't have to be any unnecessary speculation. Patrick
On Mar 17, 6:00&#4294967295;am, Cyril Novikov  <spamt...@crayne.org> wrote:
> Stargazer wrote: > > Hi, > > > I am writing my own real-time kernel for x86. Now I face something > > really strange (or may be rather it's not; it has been some time since > > I was in the details of x86 microarchitecture). > > > I measured CPU clocks elapsed between the first assembly instruction > > executed at interrupt's entry point in IDT and beginning of the C code > > of user-defined interrupt handler and the result was a big > > surprise :-) It took about 2500 cycles despite that I have only a > > handful of assembly instructions before a call to user-supplied IRQ > > handler. > > > A little more testing showed that almost all cycles (2300+) were spent > > at access to a global variable (via ds:[] addressing). Nothing that > > accesses stack memory (push, pop, call, mov) makes a noticeable > > difference. Does anybody have an idea why this happens? I test on > > Celeron 2.8G in protected mode set up for flat model with paging > > disabled. > > That's unlikely. 2300 cycles is about 1us, no memory is that slow. That > looks more like ISA bus access time. Do you have any I/O instructions or > MMIO accesses there, by any chance? How do you measure cycles, did you > remember to use synchronizing instructions with RDTSC?
I use RTDSC twice and display difference between them. I didn't serialize RDTSC, but what inaccuracy that would cause? I would accept that it explains even the +/- 10 cycles in measurements, but not 2000.
> > I can post the code of the interrupt's entry point (until the C entry > > point is called), but it's rather trivial and not long. > > That would be helpful.- Hide quoted text -
The code is below. Only the relevant snippet is posted. I hid most of irrelevant code (e.g. there is some logic following the code that loops through possible maximum of 16 chained interrupts, but I currently don't chain anything). "TEST_INT_RESPONSE" is defined above the code and "_int_received" is a public label that marks two doublewords. Comments "; Ts from here..." are put where I inserted the three-instruction sequence (rdtsc and 2 moves) and the TS difference that I got until rdtsc in inline assembly in the C code. It appears that the "inc dword [_running_irq]" accounts for all the mess. BTW, now I notice that there is not only read through ds:[] (bt), but also two writes to ds:[] following the rtdsc that I use to store time stamp for further comparison and neither causes any anomaly. So it just strengthens my suspect that RMW instruction has some strange effect on more distant caches. Does anybody have an idea? Thanks, D ------------ start of code ----------------- handle_int: push eax push ebx %IFDEF TEST_INT_RESPONSE ; ; Test latency for interrupt. ; This code measures latency without itself. With the measurement code the latency is slightly higher ; push edx rdtsc mov [_int_received], eax mov [_int_received+4], edx ; Dummy instructions that take about the same time as instructions above push 10h jmp dummy1 dummy1: push eax push ebx ; Enough, go ahead add esp, 12 pop edx ; ; End of test latency for interrupt ; %ENDIF mov ebx, [esp+8] ; ; Acknowledge interupt with PIC(s). ; After calling callbacks, to enable nesting of interrupt handlers according to priorities ; TODO: check rotation priorities of 8259 and other options ; mov al, CMD_EOI out PIC_MASTER, al cmp ebx, 8 jb fin out PIC_SLAVE, al fin: ; ; Ts from here to int handler is 2340-2350 cycles ; bt [_int_callbacks], ebx jnc return_int ; ; Ts from here to int handler is 2340-2350 cycles ; ; On stack saved already: EAX, EBX ; Don't trust C code, save all regs (except for EBP, ESP) push ecx push edx push esi push edi ; ; Ts from here to int handler is ~2340 (cycles) ; %IF 0 push ebx ; int_no call _call_int_callbacks add esp, 4 %ELSE ; ; Implement _call_int_callbacks in assembly ; ;;;;; Just to test where cycles go ; push eax ; push edx ; rdtsc ; mov [_int_received], eax ; mov [_int_received+4], edx ; pop edx ; pop eax ;;;;; inc dword [_running_irq] ; ; Ts from here to int handler is 105-115 (cycles) ; mov eax, 10 ; MAX_IRQ_HANDLERS mov ecx, eax ; ; Ts from here to int handler is 105-115 (cycles) ; mul ebx lea eax, [eax*4] ; ; Ts from here to int handler is 90-105 (cycles) ; next_int_handler: push eax ; ; Ts from here to int handler is 90-105 (cycles) ; call dword [_isr_proc+eax] [...] %ENDIF ------------ end of code --------------
On Mar 16, 8:00&#4294967295;pm, Cyril Novikov  <spamt...@crayne.org> wrote:
> That's unlikely. 2300 cycles is about 1us, no memory is that slow. That > looks more like ISA bus access time. Do you have any I/O instructions or > MMIO accesses there, by any chance? How do you measure cycles, did you > remember to use synchronizing instructions with RDTSC?
Actually, PCI reads tend to take about 1us, and they are generally implemented as memory reads. Are you reading an interrupt status register to check for the interrupt condition, and that's the memory read that eats a us? What is this memory location?
Stargazer wrote:

> ; Acknowledge interupt with PIC(s). > ; After calling callbacks, to enable nesting of interrupt handlers > according to priorities > ; TODO: check rotation priorities of 8259 and other options > ; > mov al, CMD_EOI > out PIC_MASTER, al > cmp ebx, 8 > jb fin > out PIC_SLAVE, al
These 2 OUT's are what takes 2000+ cycles. The reason you see it elsewhere is most likely a bug in your measurement code.
Stargazer wrote:
> I measured CPU clocks elapsed between the first assembly instruction > executed at interrupt's entry point in IDT and beginning of the C code > of user-defined interrupt handler and the result was a big > surprise :-) It took about 2500 cycles despite that I have only a > handful of assembly instructions before a call to user-supplied IRQ > handler.
Is the Celeron 2.8G related to the Pentium 4? I'm not clued up on processors of the last few years. Interrupts, int and iret took about 1900 to 2000 cycles on the P4 processors for some reason. I know it was significantly higher than the earlier processors that took a few hundred cycles. I'm not sure how many cycles each instruction takes on its own though. I think someone, maybe Piotr ran some benchmarks of int/iret and sysenter/sysexit on the P4 a few years ago and posted them in a.o.d. -- Marv
On Mar 17, 12:12&#4294967295;pm, "Wolfgang Kern"  <spamt...@crayne.org> wrote:
> Stargazer wrote: > > ... > > > I measured CPU clocks elapsed between the first assembly instruction > > executed at interrupt's entry point in IDT and beginning of the C code > > of user-defined interrupt handler and the result was a big > > surprise :-) It took about 2500 cycles despite that I have only a > > handful of assembly instructions before a call to user-supplied IRQ > > handler. > > A naked IRQ (just EOI and IRET) takes about 200 cycles, if measured in > an unproteced environment (ie: my OS where all code run with PL0). > > One big cycle loss is in the PL-transition, especially if hardware > task-switches are in use.
I use "classical" monolithic embedded OS model: both OS and application run on PL0, no PL switches, no address spaces (CR0.PG=0) and TSS are not used. 200-300 cycles is what I get without that "inc" instruction. D.
On Mar 17, 3:08&#4294967295;pm, (Patrick Klos) <spamt...@crayne.org> wrote:
> In article <7366de24-bc1c-47f1-9fc1-6de7d66d6...@s37g2000prg.googlegroups.com>, > Stargazer &#4294967295; <spamt...@crayne.org> wrote: > > >Hi, > > >I am writing my own real-time kernel for x86. Now I face something > >really strange (or may be rather it's not; it has been some time since > >I was in the details of x86 microarchitecture). > > >I measured CPU clocks elapsed between the first assembly instruction > >executed at interrupt's entry point in IDT and beginning of the C code > >of user-defined interrupt handler and the result was a big > >surprise :-) It took about 2500 cycles despite that I have only a > >handful of assembly instructions before a call to user-supplied IRQ > >handler. > > >A little more testing showed that almost all cycles (2300+) were spent > >at access to a global variable (via ds:[] addressing). > > Is this a SINGLE memory access or several/many accesses?
It's a single read-modify-write instruction (see "inc [xxx]" in the posted code). I don't know how would you define it, just hope that there isn't some fancy incrementer in memory controller :-) Normal single memory accesses which I also do (see "bt" and 2 "mov"s after "rdtsc") don't cause anything unexpected. D
On Mar 17, 8:44&#4294967295;pm, Cyril Novikov  <spamt...@crayne.org> wrote:
> Stargazer wrote: > > &#4294967295; &#4294967295;; Acknowledge interupt with PIC(s). > > &#4294967295; &#4294967295;; After calling callbacks, to enable nesting of interrupt handlers > > according to priorities > > &#4294967295; &#4294967295;; TODO: check rotation priorities of 8259 and other options > > &#4294967295; &#4294967295;; > > &#4294967295; &#4294967295;mov &#4294967295; &#4294967295; al, CMD_EOI > > &#4294967295; &#4294967295;out &#4294967295; &#4294967295; PIC_MASTER, al > > &#4294967295; &#4294967295;cmp &#4294967295; &#4294967295; ebx, 8 > > &#4294967295; &#4294967295;jb &#4294967295; &#4294967295; &#4294967295;fin > > &#4294967295; &#4294967295;out &#4294967295; &#4294967295; PIC_SLAVE, al > > These 2 OUT's are what takes 2000+ cycles. The reason you see it elsewhere is > most likely a bug in your measurement code.
Why would it take 2000+ cycles? interrupt controller is integrated in a chipset which provides pretty fast access. Anyway, your assertion is not true: if I just comment out the "inc dword [_running_irq]", I get total latency of about 300 cycles. D
On Mar 17, 10:06&#4294967295;pm, "Marven Lee"  <spamt...@crayne.org> wrote:
> Stargazer wrote: > > I measured CPU clocks elapsed between the first assembly instruction > > executed at interrupt's entry point in IDT and beginning of the C code > > of user-defined interrupt handler and the result was a big > > surprise :-) It took about 2500 cycles despite that I have only a > > handful of assembly instructions before a call to user-supplied IRQ > > handler. > > Is the Celeron 2.8G related to the Pentium 4? &#4294967295;I'm not clued up on > processors of the last few years.
Yes, it's Celeron from Pentium-4 family. However, I don't include processor's interrupt entry and IRET into timing. For now I want my interrupt entry code to introduce minimal additional latency, not deal with matters that I can't change anyway :-) As I said, I know where the problem lies, but I don't understand the reason so I can't yet eliminate it. D