Stargazer escribi&#4294967295;:
> ...
 >
> A little more testing showed that almost all cycles (2300+) were spent
> at access to a global variable (via ds:[] addressing). Nothing that
> accesses stack memory (push, pop, call, mov) makes a noticeable
> difference. Does anybody have an idea why this happens? I test on
> Celeron 2.8G in protected mode set up for flat model with paging
> disabled.

Just catching flies...: are you sure that no exception code (trap or 
whatever, it's a long time since I assembled x86 code) is being called 
by your inc instruction? Imagine, for instance, an access violation (bad 
GDT)

On Mar 19, 12:05 pm, Tim Roberts  <spamt...@crayne.org> wrote:
> Stargazer  <spamt...@crayne.org> wrote:
> >On Mar 17, 8:44&#65533;pm, Cyril Novikov  <spamt...@crayne.org> wrote:
> >> Stargazer wrote:
> >> > &#65533; &#65533;;
> >> > &#65533; &#65533;mov &#65533; &#65533; al, CMD_EOI
> >> > &#65533; &#65533;out &#65533; &#65533; PIC_MASTER, al
> >> > &#65533; &#65533;cmp &#65533; &#65533; ebx, 8
> >> > &#65533; &#65533;jb &#65533; &#65533; &#65533;fin
> >> > &#65533; &#65533;out &#65533; &#65533; PIC_SLAVE, al
>
> >> These 2 OUT's are what takes 2000+ cycles. The reason you see it elsewhere is
> >> most likely a bug in your measurement code.
>
> >Why would it take 2000+ cycles? interrupt controller is integrated in
> >a chipset which provides pretty fast access.
>
> That is irrelevant.  Cyril is quite correct.  I/O port accesses take about
> a microsecond, regardless of bus speed or where the port is located.  Part
> of this comes from the need to maintain historical compatibility.
> --
> Tim Roberts, t...@probo.com
> Providenza & Boekelheide, Inc.

Don't know if this helps, but on my 1GHz P3 laptop using just the
following code,

   RDTSC...
   mov al,20h
   out 20h,al
   RDTSC...   produces 450 cycles and

   RDTSC...
   mov al,20h
   out 20h,al
   out 0a0h,al
   RDTSC...   produces 900 cycles.

   So it would appear as 450nS and 900nS respectively. Since your
running at 2.8GHz then 900nS would take 2520 cycles. Serializing
didn't seem to make much difference nor running real mode or flat
32bit protected mode. One interesting note was that looping through 5
or more times and taking the final result was 20 cycles slower than
just a single run. 470 & 920.
   Chapter 13, Input/Output, in the IA-32 Intel Architecture Software
Developer&rsquo;s Manual, Volume 1, seems to indicate that "I/O writes to
control system hardware cause the hardware to be set to its new state
before any other instructions are executed". I'm not a programmer so I
may have misunderstood and it's all just coincidence, hopefully
someone will put me right. Thanks for giving me some code to play
with, I'm sad to say it was fun. :)

On 17 Mar, 13:16, Stargazer  <spamt...@crayne.org> wrote:

...

> > > A little more testing showed that almost all cycles (2300+) were spent
> > > at access to a global variable (via ds:[] addressing).

...

> It appears
> that the "inc dword [_running_irq]" accounts for all the mess.
>
> BTW, now I notice that there is not only read through ds:[] (bt), but
> also two writes to ds:[] following the rtdsc that I use to store time
> stamp for further comparison and neither causes any anomaly. So it
> just strengthens my suspect that RMW instruction has some strange
> effect on more distant caches. Does anybody have an idea?

--- <code snipped> ---

I'm a bit confused as to how the measurements are taken. Why not try
to measure by using the following which is based on

http://cs.smu.ca/~jamuir/rdtscpm1.pdf

cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax

cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax

cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax // Only the last value of subtime is kept
// subtime should now represent the overhead cost of the
// MOV and CPUID instructions


...other instructions...


;Test 1: the single inc instruction

cpuid // Serialize execution
rdtsc // Read time stamp to EAX
mov time_1, eax ;Time for this instruction

inc     dword [_running_irq]    #Taken from your code

cpuid // Serialize again for time-stamp read
rdtsc
sub eax, time_1 // Find the difference
mov time_1, eax


...other instructions...


#Now time_1 minus subtime should give length of test 1

(As you know the above clobbers eax, ebx, ecx, edx each time cpuid is
run so you need to push/pop to protect them if needed.)

If the time is still of the order of 2000 cycles maybe try splitting
your inc instruction to

mov esi, _running_irq
inc esi
mov _running_irq, esi

where esi is used as eax will be trashed. If still as long (unlikely,
for the reasons you mentioned) then you could split the measurement
points.

--
James

Stargazer wrote:

>>>    ; Acknowledge interupt with PIC(s).
>>>    ; After calling callbacks, to enable nesting of interrupt handlers
>>> according to priorities
>>>    ; TODO: check rotation priorities of 8259 and other options
>>>    ;
>>>    mov     al, CMD_EOI
>>>    out     PIC_MASTER, al
>>>    cmp     ebx, 8
>>>    jb      fin
>>>    out     PIC_SLAVE, al
>> These 2 OUT's are what takes 2000+ cycles. The reason you see it elsewhere is
>> most likely a bug in your measurement code.
> 
> Why would it take 2000+ cycles? interrupt controller is integrated in
> a chipset which provides pretty fast access.

Because i8259 is still an LPC (new name for ISA) device working at the 
slow ISA bus speed.

> Anyway, your assertion is
> not true: if I just comment out the "inc dword [_running_irq]", I get
> total latency of about 300 cycles.

This is really strange and intriguing. I assume you've made sure that 
your timing code is correct (i.e. uses synchronizing instructions as 
Intel recommends). Can you try incrementing a different address, does it 
make a difference? An address off SS segment (e.g. inc dword ss:[-4])? 
Replace the increment with a mov?

Stargazer  <spamtrap@crayne.org> wrote:
>On Mar 17, 8:44&#4294967295;pm, Cyril Novikov  <spamt...@crayne.org> wrote:
>> Stargazer wrote:
>> > &#4294967295; &#4294967295;;
>> > &#4294967295; &#4294967295;mov &#4294967295; &#4294967295; al, CMD_EOI
>> > &#4294967295; &#4294967295;out &#4294967295; &#4294967295; PIC_MASTER, al
>> > &#4294967295; &#4294967295;cmp &#4294967295; &#4294967295; ebx, 8
>> > &#4294967295; &#4294967295;jb &#4294967295; &#4294967295; &#4294967295;fin
>> > &#4294967295; &#4294967295;out &#4294967295; &#4294967295; PIC_SLAVE, al
>>
>> These 2 OUT's are what takes 2000+ cycles. The reason you see it elsewhere is
>> most likely a bug in your measurement code.
>
>Why would it take 2000+ cycles? interrupt controller is integrated in
>a chipset which provides pretty fast access.

That is irrelevant.  Cyril is quite correct.  I/O port accesses take about
a microsecond, regardless of bus speed or where the port is located.  Part
of this comes from the need to maintain historical compatibility.
-- 
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

In message
<954433d5-a1fa-466e-bc47-4764b361952d@2g2000hsn.googlegroups.com>,
Stargazer <spamtrap@crayne.org> wrote:

> On Mar 17, 8:44&nbsp;pm, Cyril Novikov  <spamt...@crayne.org> wrote:
>> Stargazer wrote:
>> > ; Acknowledge interupt with PIC(s).
>> > ; After calling callbacks, to enable nesting of interrupt handlers
>> > according to priorities
>> > ; TODO: check rotation priorities of 8259 and other options
>> > ;
>> > mov &nbsp; &nbsp; al, CMD_EOI
>> > out &nbsp; &nbsp; PIC_MASTER, al
>> > cmp &nbsp; &nbsp; ebx, 8
>> > jb &nbsp; &nbsp; &nbsp;fin
>> > out &nbsp; &nbsp; PIC_SLAVE, al
>>
>> These 2 OUT's are what takes 2000+ cycles. The reason you see it
>> elsewhere is most likely a bug in your measurement code.
> 
> Why would it take 2000+ cycles? interrupt controller is integrated in
> a chipset which provides pretty fast access.

8259 is a legacy interupt controller interface, APIC is the modern standard.

> Anyway, your assertion is 
> not true: if I just comment out the "inc dword [_running_irq]", I get
> total latency of about 300 cycles.

But modern processors execute instructions out of the order specified in the
program! It is quite possible that the out instructions have not completed
execution when the second rdtsc executes, unless the "inc dword
[_running_irq]" is there.

If this is on a multi-processor machine cacheline ping-pong is a possible
cause.

Stargazer wrote:
> On Mar 17, 10:06 pm, "Marven Lee"  <spamt...@crayne.org> wrote:
>> Stargazer wrote:
>>> I measured CPU clocks elapsed between the first assembly instruction
>>> executed at interrupt's entry point in IDT and beginning of the C code
>>> of user-defined interrupt handler and the result was a big
>>> surprise :-) It took about 2500 cycles despite that I have only a
>>> handful of assembly instructions before a call to user-supplied IRQ
>>> handler.
>> Is the Celeron 2.8G related to the Pentium 4?  I'm not clued up on
>> processors of the last few years.
> 
> Yes, it's Celeron from Pentium-4 family. However, I don't include
> processor's interrupt entry and IRET into timing. For now I want my
> interrupt entry code to introduce minimal additional latency, not deal
> with matters that I can't change anyway :-)
> 
> As I said, I know where the problem lies, but I don't understand the
> reason so I can't yet eliminate it.
> 
> 
> D
> 
Does the same instruction take a different length of time if accessing 
another variable? Likewise, does the same instruction on the same 
variable take as long if done whilst NOT in an interrupt routine?

Is the variable somewhere odd, or is there a second instance of it, so 
that you are not accessing what you think you are? Have you viewed the 
actual machine code to see that nothing odd has been added by your 
assembly phase?

Just clutching at straws here.

Matt

On Mar 17, 10:06&#4294967295;pm, "Marven Lee"  <spamt...@crayne.org> wrote:
> Stargazer wrote:
> > I measured CPU clocks elapsed between the first assembly instruction
> > executed at interrupt's entry point in IDT and beginning of the C code
> > of user-defined interrupt handler and the result was a big
> > surprise :-) It took about 2500 cycles despite that I have only a
> > handful of assembly instructions before a call to user-supplied IRQ
> > handler.
>
> Is the Celeron 2.8G related to the Pentium 4? &#4294967295;I'm not clued up on
> processors of the last few years.

Yes, it's Celeron from Pentium-4 family. However, I don't include
processor's interrupt entry and IRET into timing. For now I want my
interrupt entry code to introduce minimal additional latency, not deal
with matters that I can't change anyway :-)

As I said, I know where the problem lies, but I don't understand the
reason so I can't yet eliminate it.

D

On Mar 17, 8:44&#4294967295;pm, Cyril Novikov  <spamt...@crayne.org> wrote:
> Stargazer wrote:
> > &#4294967295; &#4294967295;; Acknowledge interupt with PIC(s).
> > &#4294967295; &#4294967295;; After calling callbacks, to enable nesting of interrupt handlers
> > according to priorities
> > &#4294967295; &#4294967295;; TODO: check rotation priorities of 8259 and other options
> > &#4294967295; &#4294967295;;
> > &#4294967295; &#4294967295;mov &#4294967295; &#4294967295; al, CMD_EOI
> > &#4294967295; &#4294967295;out &#4294967295; &#4294967295; PIC_MASTER, al
> > &#4294967295; &#4294967295;cmp &#4294967295; &#4294967295; ebx, 8
> > &#4294967295; &#4294967295;jb &#4294967295; &#4294967295; &#4294967295;fin
> > &#4294967295; &#4294967295;out &#4294967295; &#4294967295; PIC_SLAVE, al
>
> These 2 OUT's are what takes 2000+ cycles. The reason you see it elsewhere is
> most likely a bug in your measurement code.

Why would it take 2000+ cycles? interrupt controller is integrated in
a chipset which provides pretty fast access. Anyway, your assertion is
not true: if I just comment out the "inc dword [_running_irq]", I get
total latency of about 300 cycles.


D

On Mar 17, 3:08&#4294967295;pm, (Patrick Klos) <spamt...@crayne.org> wrote:
> In article <7366de24-bc1c-47f1-9fc1-6de7d66d6...@s37g2000prg.googlegroups.com>,
> Stargazer &#4294967295; <spamt...@crayne.org> wrote:
>
> >Hi,
>
> >I am writing my own real-time kernel for x86. Now I face something
> >really strange (or may be rather it's not; it has been some time since
> >I was in the details of x86 microarchitecture).
>
> >I measured CPU clocks elapsed between the first assembly instruction
> >executed at interrupt's entry point in IDT and beginning of the C code
> >of user-defined interrupt handler and the result was a big
> >surprise :-) It took about 2500 cycles despite that I have only a
> >handful of assembly instructions before a call to user-supplied IRQ
> >handler.
>
> >A little more testing showed that almost all cycles (2300+) were spent
> >at access to a global variable (via ds:[] addressing).
>
> Is this a SINGLE memory access or several/many accesses?

It's a single read-modify-write instruction (see "inc [xxx]" in the
posted code). I don't know how would you define it, just hope that
there isn't some fancy incrementer in memory controller :-) Normal
single memory accesses which I also do (see "bt" and 2 "mov"s after
"rdtsc") don't cause anything unexpected.


D