timestamp in ms and 64-bit counter| page 3

Reply by Jim Jackson ●February 7, 20202020-02-07

>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so.  
>
> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days)

Ah yes. Early versions of Windows NT. Crashed if they had an uptime
of 49 and a bit days - I wonder why :-) 

Mind you getting early NT to stay up that long without crashing or needing 
a reboot was bloody difficult.

Reply by ●February 8, 20202020-02-08

On Fri, 7 Feb 2020 22:03:59 -0000 (UTC), Jim Jackson
<jj@franjam.org.uk> wrote:

>>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so.  
>>
>> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days)
>
>Ah yes. Early versions of Windows NT. Crashed if they had an uptime
>of 49 and a bit days - I wonder why :-) 
>
>Mind you getting early NT to stay up that long without crashing or needing 
>a reboot was bloody difficult.

Which NT version was that ?

My NT 3.51 very seldom needed reboots. In many years I booted it only
three time after, Eastern, Christmas and summer vacations, since I did
not want to leave the computer unattended for a week or more at a
time.

Reply by Jim Jackson ●February 8, 20202020-02-08

On 2020-02-08, upsidedown@downunder.com <upsidedown@downunder.com> wrote:
> On Fri, 7 Feb 2020 22:03:59 -0000 (UTC), Jim Jackson
><jj@franjam.org.uk> wrote:
>
>>>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so.  
>>>
>>> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days)
>>
>>Ah yes. Early versions of Windows NT. Crashed if they had an uptime
>>of 49 and a bit days - I wonder why :-) 
>>
>>Mind you getting early NT to stay up that long without crashing or needing 
>>a reboot was bloody difficult.
>
> Which NT version was that ?

Sorry, I wasn't part of the MS team, but it was a very early version. And 
it was a long time ago. I remember someone did the calculation above - and 
it was reported to MS support. I think there was no feedback other than, a 
set of patches later on.

The Windows NT file servers also went on a temporary go slow periodically, 
which had our MS team baffled for a while. Eventually someone twigged that 
it was when you went to the graphical console and interrupted the 
screensaver that the sluggish performance went away - yes the screensaver
was hogging the CPU and hitting fileserve performance! Needless to say the 
Unix team sniggered at that.

> 
> My NT 3.51 very seldom needed reboots. In many years I booted it only 
> three time after, Eastern, Christmas and summer vacations, since I did 
> not want to leave the computer unattended for a week or more at a 
> time. 
>

Reply by David Brown ●February 8, 20202020-02-08

On 07/02/2020 23:03, Jim Jackson wrote:
>>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so.
>>
>> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days)
> 
> Ah yes. Early versions of Windows NT. Crashed if they had an uptime
> of 49 and a bit days - I wonder why :-)
> 
> Mind you getting early NT to stay up that long without crashing or needing
> a reboot was bloody difficult.
> 

I believe it was Windows 95 that had this problem - and it was not 
discovered until about 2005, because no one had kept Windows 95 running 
for 49 days.

Maybe early NT had a similar fault, of course.  But people /did/ have NT 
running for long uptimes from very early on, so such a bug would have 
been found fairly quickly.

Reply by David Brown ●February 8, 20202020-02-08

On 07/02/2020 16:49, Bernd Linsel wrote:
> David Brown wrote:
>> On 07/02/2020 09:27, pozz wrote:
>>
>> That is a useful thought.&nbsp; It is very important to write code in a way
>> that it can be tested.&nbsp; And even then, remember that testing can only
>> prove the /presence/ of bugs, never to prove their /absence/.
>>
>> Another trick during testing is to speed up the timers.&nbsp; If you can make
>> the 1 kHz timer run at 1 MHz for testing, you'll get similar benefits.
>>
> 
> The Linux approach is still better:
> Initialize the timer count variable with a value just some seconds 
> before it wraps.
> 

That's another good approach, yes.

Reply by pozz ●February 8, 20202020-02-08

Il 07/02/2020 16:49, Bernd Linsel ha scritto:
> David Brown wrote:
>> On 07/02/2020 09:27, pozz wrote:
>>
>> That is a useful thought.&nbsp; It is very important to write code in a way
>> that it can be tested.&nbsp; And even then, remember that testing can only
>> prove the /presence/ of bugs, never to prove their /absence/.
>>
>> Another trick during testing is to speed up the timers.&nbsp; If you can make
>> the 1 kHz timer run at 1 MHz for testing, you'll get similar benefits.
>>
> 
> The Linux approach is still better:
> Initialize the timer count variable with a value just some seconds 
> before it wraps.

This helps if the bug is deterministic. If it isn't and it doesn't 
happen after startup at the first wrap-around, it takes 49 days to have 
another possibility to see the bug.

Reply by ●February 8, 20202020-02-08

On Sat, 8 Feb 2020 16:24:30 +0100, David Brown
<david.brown@hesbynett.no> wrote:

>On 07/02/2020 23:03, Jim Jackson wrote:
>>>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so.
>>>
>>> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days)
>> 
>> Ah yes. Early versions of Windows NT. Crashed if they had an uptime
>> of 49 and a bit days - I wonder why :-)
>> 
>> Mind you getting early NT to stay up that long without crashing or needing
>> a reboot was bloody difficult.
>> 
>
>I believe it was Windows 95 that had this problem - and it was not 
>discovered until about 2005, because no one had kept Windows 95 running 
>for 49 days.

That is a more believable explanation.

>Maybe early NT had a similar fault, of course.  But people /did/ have NT 
>running for long uptimes from very early on, so such a bug would have 
>been found fairly quickly.

Both VAX/VMS as well as Windows NT use 100 ns as the basic unit for
time of the day timing. 

On Windows NT on single processor the interrupt rate was 100 Hz, on
muliprocessor 64 Hz.

Some earlier Windows versions used 55 Hz (or was it 55 ms) clock
interrupt rate, so I really don't understand from where the 1 ms clock
tick or 49 days is from.

Reply by Kent Dickey ●February 8, 20202020-02-08

In article <r1h1lj$i5f$1@dont-email.me>, pozz  <pozzugno@gmail.com> wrote:
>I need a timestamp in millisecond in linux epoch. It is a number that 
>doesn't fit in a 32-bits number.
>
>I'm using a 32-bit MCU (STM32L4R9...) so I don't have a 64-bits hw 
>counter. I need to create a mixed sw/hw 64-bits counter. It's very 
>simple, I configure a 32-bits hw timer to run at 1kHz and increment an 
>uint32_t variable in timer overflow ISR.
>
>Now I need to implement a GetTick() function that returns a uint64_t. I 
>know it could be difficult, because of race conditions. One solutions is 
>to disable interrupts, but I remember another solution.

This is actually a very tricky problem.  I believe it is not possible to
solve it with the constraints you have laid out above.  David Brown's solution
in his GetTick() function is correct, but it doesn't discuss why.

If you have a valid 64-bit counter which you can only reference 32-bits at
a time (which I'll make functions, read_high32() and read_low32(), but these
can be hardware registers, volatile globals, or real functions), then an
algorithm to read it reliably is basically your original algorithm:

uint64_t
GetTick()
{
        old_high32 = read_high32();
        while(1) {
                low32 = read_low32();
                new_high32 = read_high32();
                if(new_high32 == old_high32) {
                        return ((uint64_t)new_high32 << 32) | low32;
                }
                old_high32 = new_high32;
        }
}

This code does not need to mask interrupts, and it works on multiple CPUs.
This works even if interrupts occur at any point for any duration, even
if the code is interrupted for more than 49 days.

However, you don't have a valid 64-bit counter you can only read 32-bits at a
time.  You have a free-running hardware counter which read_low32() returns.
It counts up every 1ms, and eventually wraps from 0xffff_ffff to 0x0000_0000
and causes an interrupt (which lots of people have helpfully calculated at
about 49 days).  Let's assume that interrupt calls this handler:

volatile uint32_t ticks_high = 0;
void
timer_wrap_interrupt()
{
        ticks_high++;
}

where by convention only this code will write to ticks_high (this is a very
important limitation).  And so my function read_high32() is simply:
{ return ticks_high; }.

Unfortunately, with this design, I believe it is not possible to implement
a GetTick() function which does not sometimes fail to return a correct time.
There is a fundamental race between the interrupt and the timer value rolling
to 0 which software cannot account for.

The problem is it's possible for software to read the HW counter and see it
has rolled over from 0xffff_ffff to 0 BEFORE the interrupt occurs which
increments ticks_high.  This is an inherent race: the timer wraps to 0, and
signals an interrupt.  It's possible, even if for only a few cycles, to
read the register and see the zero before the interrupt is taken.

Shown more explicitly, the following are all valid states (let's assume
ticks_high is 0, read_low32() just ticked to 0xffff_fffe):

Time            read_low32()            ticks_high
-------------------------------------------------
0               0xffff_fffe             0
1ms             0xffff_ffff             0
1.99999ms       0xffff_ffff             0
2ms             0x0000_0000             0
Interrupt is sent and is now pending
2ms+delta       0x0000_0000             1

The issue is: what is "delta", and can other code (including your GetTick()
function) run between "2ms" and "2ms+delta"?  And the answer is almost
assuredly "yes".  This is a problem.

The GetTick() routine above can read g_high32==0, read_low32()==0, and then
g_high32==0 again at around time 2ms+small_amount, and return 0, even though
a cycle or two ago, read_low32() returned 0xffff_ffff.  So time appears to
jump backwards 49 days when this happens.

There are a variety of solutions to this problem, but they all involve
extra work and ignoring the 32-bit rollover interrupt.  So, remove
timer_wrap_interrupt(), and then do:

1) Have a single GetTick() routine, which is single-tasking (by
disabling interrupts, or a mutex if there are multiple processors).
This requires something to call GetTick() at least once every 49 days
(worst case).  This is basically the Rich C./David Brown solution, but
they don't mention that you need to remove the interrupt on 32-bit overflow.

2) Use a higher interrupt rate.  For instance, if we can take the interrupt
when read_low32() has carry from bit 28 to bit 29, then we can piece together
code which can work as long as GetTick() isn't delayed by more than 3-4 days.
This require GetTick() to change using code given under #4 below.

3) Forget the hardware counter: just take an interrupt every 1ms, and
increment a global variable uint64_t ticks64 on each interrupt, and then
GetTick just returns ticks64.  This only works if the CPU hardware supports
atomic 64-bit accesses.  It's not generally possible to write C code for a
32-bit processor which can guarantee 64-bit atomic ops, so it's best to have
the interrupt handler deal with two 32-bit variables ticks_low and
ticks_high, and then you still need the GetTicks() to have a while loop to
read the two variables.

4) Use a regular existing interrupt which occurs at any rate, as long as it's
well over 1ms, and well under 49 days.  Let's assume you have a 1-second
interrupt.  This can be asynchronous to the 1ms timer.  In that interrupt
handler, you sample the 32-bit hardware counter, and if you notice it
wrapping (previous read value > new value), increment ticks_high.
You need to update the global volatile variable ticks_low as well as the
current hw count.  And this interrupt handler needs to be the only code
changing ticks_low and ticks_high.  Then, GetTick() does the following:

        uint32_t local_ticks_low, local_ticks_high;
        [ while loop to read valid ticks_low and ticks_high value into the
                local_* variables ]
        uint64_t ticks64 = ((uint64_t)local_ticks_high << 32) | local_ticks_low;
        ticks64 += (int32_t)(read_low32() - local_ticks_low);
        return ticks64;

Basically, we return the ticks64 from the last regular interrupt, which could
be 1 second ago, and we add in the small delta from reading the hw counter.
Again, this requires the 1-second interrupt to be guaranteed to happen before
we get close to 49 days since the last 1-second interrupt (if it's really
a 1-second interrupt, it easily meets that criteria.  If you try to pick
something irregular, like a keypress interrupt, then that won't work).  It
does not depend on the exact rate of the interrupt at all.

I wrote it above with extra safety--It subtracts two 32-bit unsigned variables,
gets a 32-bit unsigned result, treats that as a 32-bit signed result, and adds
that to the 64-bit unsigned ticks count.  It's not strictly necessary to do
the 32-bit signed result cast: it just makes the code more robust in case
the HW timer moves backwards slightly.  Imagine some code tries to adjust the
current timer value by setting it backwards slightly (say, some code trying
to calibrate the timer with the RTC or something).  Without the cast to
32-bit signed int, this slight backwards move would result in ticks64
jumping ahead 49 days, which would be bad.  In C, this is pretty easy, but it
should be carefully commented so no one removes any important casts.

Kent

Reply by Richard Damon ●February 8, 20202020-02-08

On 2/8/20 12:03 PM, Kent Dickey wrote:
> Shown more explicitly, the following are all valid states (let's assume
> ticks_high is 0, read_low32() just ticked to 0xffff_fffe):
> 
> Time            read_low32()            ticks_high
> -------------------------------------------------
> 0               0xffff_fffe             0
> 1ms             0xffff_ffff             0
> 1.99999ms       0xffff_ffff             0
> 2ms             0x0000_0000             0
> Interrupt is sent and is now pending
> 2ms+delta       0x0000_0000             1
> 
> The issue is: what is "delta", and can other code (including your GetTick()
> function) run between "2ms" and "2ms+delta"?  And the answer is almost
> assuredly "yes".  This is a problem.

But, as long as the timing is such that we can not do BOTH the 
read_low32() and the read of ticks_high in that delta, we can't get the 
wrong number.

This is somewhat a function of the processor, and how much the 
instruction pipeline 'skids' when an interrupt occurs. The processor 
that he mentioned, A STM32L4R9, which uses an M4 processor, doesn't have 
this much of a skid, so that can't be a problem unless you do something 
foolish like disable the interrupts while doing the sequence.

If we put a proper barrier instruction between the read low command and 
the second read high (and we may need that just to avoid getting a 
cached value that we read from the first read), and declare it as 
volatile so the compiler doesn't do its own caching, then the problem 
doesn't occur. Again, not a problem on his processor, as it is a single 
core processor (but we still need the volatile).

Reply by Rick C ●February 8, 20202020-02-08

On Saturday, February 8, 2020 at 10:24:33 AM UTC-5, David Brown wrote:
> On 07/02/2020 23:03, Jim Jackson wrote:
> >>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so.
> >>
> >> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days)
> > 
> > Ah yes. Early versions of Windows NT. Crashed if they had an uptime
> > of 49 and a bit days - I wonder why :-)
> > 
> > Mind you getting early NT to stay up that long without crashing or needing
> > a reboot was bloody difficult.
> > 
> 
> I believe it was Windows 95 that had this problem - and it was not 
> discovered until about 2005, because no one had kept Windows 95 running 
> for 49 days.

49 days?  You mean 49 minutes?  

> Maybe early NT had a similar fault, of course.  But people /did/ have NT 
> running for long uptimes from very early on, so such a bug would have 
> been found fairly quickly.

Never used NT, but I used W2k and it was great!  W2k was widely pirated so MS started a phone home type of licensing with XP which was initially not well received, but over time became accepted.  Now people reminisce about the halcyon days of XP.  

Networking under W2k required a lot of manual setting up.  But it was not hard to do.  A web site, World of Windows Networking made it easy until it was bought and ruined with advertising and low quality content. 

Now I have trouble just getting to Win10 computers to share a file directory. 

-- 

  Rick C.

  +- Get 1,000 miles of free Supercharging
  +- Tesla referral code - https://ts.la/richard11209