EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

timestamp in ms and 64-bit counter

Started by pozz February 6, 2020
>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so. > > 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days)
Ah yes. Early versions of Windows NT. Crashed if they had an uptime of 49 and a bit days - I wonder why :-) Mind you getting early NT to stay up that long without crashing or needing a reboot was bloody difficult.
On Fri, 7 Feb 2020 22:03:59 -0000 (UTC), Jim Jackson
<jj@franjam.org.uk> wrote:

>>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so. >> >> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days) > >Ah yes. Early versions of Windows NT. Crashed if they had an uptime >of 49 and a bit days - I wonder why :-) > >Mind you getting early NT to stay up that long without crashing or needing >a reboot was bloody difficult.
Which NT version was that ? My NT 3.51 very seldom needed reboots. In many years I booted it only three time after, Eastern, Christmas and summer vacations, since I did not want to leave the computer unattended for a week or more at a time.
On 2020-02-08, upsidedown@downunder.com <upsidedown@downunder.com> wrote:
> On Fri, 7 Feb 2020 22:03:59 -0000 (UTC), Jim Jackson ><jj@franjam.org.uk> wrote: > >>>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so. >>> >>> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days) >> >>Ah yes. Early versions of Windows NT. Crashed if they had an uptime >>of 49 and a bit days - I wonder why :-) >> >>Mind you getting early NT to stay up that long without crashing or needing >>a reboot was bloody difficult. > > Which NT version was that ?
Sorry, I wasn't part of the MS team, but it was a very early version. And it was a long time ago. I remember someone did the calculation above - and it was reported to MS support. I think there was no feedback other than, a set of patches later on. The Windows NT file servers also went on a temporary go slow periodically, which had our MS team baffled for a while. Eventually someone twigged that it was when you went to the graphical console and interrupted the screensaver that the sluggish performance went away - yes the screensaver was hogging the CPU and hitting fileserve performance! Needless to say the Unix team sniggered at that.
> > My NT 3.51 very seldom needed reboots. In many years I booted it only > three time after, Eastern, Christmas and summer vacations, since I did > not want to leave the computer unattended for a week or more at a > time. >
On 07/02/2020 23:03, Jim Jackson wrote:
>>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so. >> >> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days) > > Ah yes. Early versions of Windows NT. Crashed if they had an uptime > of 49 and a bit days - I wonder why :-) > > Mind you getting early NT to stay up that long without crashing or needing > a reboot was bloody difficult. >
I believe it was Windows 95 that had this problem - and it was not discovered until about 2005, because no one had kept Windows 95 running for 49 days. Maybe early NT had a similar fault, of course. But people /did/ have NT running for long uptimes from very early on, so such a bug would have been found fairly quickly.
On 07/02/2020 16:49, Bernd Linsel wrote:
> David Brown wrote: >> On 07/02/2020 09:27, pozz wrote: >> >> That is a useful thought.&nbsp; It is very important to write code in a way >> that it can be tested.&nbsp; And even then, remember that testing can only >> prove the /presence/ of bugs, never to prove their /absence/. >> >> Another trick during testing is to speed up the timers.&nbsp; If you can make >> the 1 kHz timer run at 1 MHz for testing, you'll get similar benefits. >> > > The Linux approach is still better: > Initialize the timer count variable with a value just some seconds > before it wraps. >
That's another good approach, yes.
Il 07/02/2020 16:49, Bernd Linsel ha scritto:
> David Brown wrote: >> On 07/02/2020 09:27, pozz wrote: >> >> That is a useful thought.&nbsp; It is very important to write code in a way >> that it can be tested.&nbsp; And even then, remember that testing can only >> prove the /presence/ of bugs, never to prove their /absence/. >> >> Another trick during testing is to speed up the timers.&nbsp; If you can make >> the 1 kHz timer run at 1 MHz for testing, you'll get similar benefits. >> > > The Linux approach is still better: > Initialize the timer count variable with a value just some seconds > before it wraps.
This helps if the bug is deterministic. If it isn't and it doesn't happen after startup at the first wrap-around, it takes 49 days to have another possibility to see the bug.
On Sat, 8 Feb 2020 16:24:30 +0100, David Brown
<david.brown@hesbynett.no> wrote:

>On 07/02/2020 23:03, Jim Jackson wrote: >>>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so. >>> >>> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days) >> >> Ah yes. Early versions of Windows NT. Crashed if they had an uptime >> of 49 and a bit days - I wonder why :-) >> >> Mind you getting early NT to stay up that long without crashing or needing >> a reboot was bloody difficult. >> > >I believe it was Windows 95 that had this problem - and it was not >discovered until about 2005, because no one had kept Windows 95 running >for 49 days.
That is a more believable explanation.
>Maybe early NT had a similar fault, of course. But people /did/ have NT >running for long uptimes from very early on, so such a bug would have >been found fairly quickly.
Both VAX/VMS as well as Windows NT use 100 ns as the basic unit for time of the day timing. On Windows NT on single processor the interrupt rate was 100 Hz, on muliprocessor 64 Hz. Some earlier Windows versions used 55 Hz (or was it 55 ms) clock interrupt rate, so I really don't understand from where the 1 ms clock tick or 49 days is from.
In article <r1h1lj$i5f$1@dont-email.me>, pozz  <pozzugno@gmail.com> wrote:
>I need a timestamp in millisecond in linux epoch. It is a number that >doesn't fit in a 32-bits number. > >I'm using a 32-bit MCU (STM32L4R9...) so I don't have a 64-bits hw >counter. I need to create a mixed sw/hw 64-bits counter. It's very >simple, I configure a 32-bits hw timer to run at 1kHz and increment an >uint32_t variable in timer overflow ISR. > >Now I need to implement a GetTick() function that returns a uint64_t. I >know it could be difficult, because of race conditions. One solutions is >to disable interrupts, but I remember another solution.
This is actually a very tricky problem. I believe it is not possible to solve it with the constraints you have laid out above. David Brown's solution in his GetTick() function is correct, but it doesn't discuss why. If you have a valid 64-bit counter which you can only reference 32-bits at a time (which I'll make functions, read_high32() and read_low32(), but these can be hardware registers, volatile globals, or real functions), then an algorithm to read it reliably is basically your original algorithm: uint64_t GetTick() { old_high32 = read_high32(); while(1) { low32 = read_low32(); new_high32 = read_high32(); if(new_high32 == old_high32) { return ((uint64_t)new_high32 << 32) | low32; } old_high32 = new_high32; } } This code does not need to mask interrupts, and it works on multiple CPUs. This works even if interrupts occur at any point for any duration, even if the code is interrupted for more than 49 days. However, you don't have a valid 64-bit counter you can only read 32-bits at a time. You have a free-running hardware counter which read_low32() returns. It counts up every 1ms, and eventually wraps from 0xffff_ffff to 0x0000_0000 and causes an interrupt (which lots of people have helpfully calculated at about 49 days). Let's assume that interrupt calls this handler: volatile uint32_t ticks_high = 0; void timer_wrap_interrupt() { ticks_high++; } where by convention only this code will write to ticks_high (this is a very important limitation). And so my function read_high32() is simply: { return ticks_high; }. Unfortunately, with this design, I believe it is not possible to implement a GetTick() function which does not sometimes fail to return a correct time. There is a fundamental race between the interrupt and the timer value rolling to 0 which software cannot account for. The problem is it's possible for software to read the HW counter and see it has rolled over from 0xffff_ffff to 0 BEFORE the interrupt occurs which increments ticks_high. This is an inherent race: the timer wraps to 0, and signals an interrupt. It's possible, even if for only a few cycles, to read the register and see the zero before the interrupt is taken. Shown more explicitly, the following are all valid states (let's assume ticks_high is 0, read_low32() just ticked to 0xffff_fffe): Time read_low32() ticks_high ------------------------------------------------- 0 0xffff_fffe 0 1ms 0xffff_ffff 0 1.99999ms 0xffff_ffff 0 2ms 0x0000_0000 0 Interrupt is sent and is now pending 2ms+delta 0x0000_0000 1 The issue is: what is "delta", and can other code (including your GetTick() function) run between "2ms" and "2ms+delta"? And the answer is almost assuredly "yes". This is a problem. The GetTick() routine above can read g_high32==0, read_low32()==0, and then g_high32==0 again at around time 2ms+small_amount, and return 0, even though a cycle or two ago, read_low32() returned 0xffff_ffff. So time appears to jump backwards 49 days when this happens. There are a variety of solutions to this problem, but they all involve extra work and ignoring the 32-bit rollover interrupt. So, remove timer_wrap_interrupt(), and then do: 1) Have a single GetTick() routine, which is single-tasking (by disabling interrupts, or a mutex if there are multiple processors). This requires something to call GetTick() at least once every 49 days (worst case). This is basically the Rich C./David Brown solution, but they don't mention that you need to remove the interrupt on 32-bit overflow. 2) Use a higher interrupt rate. For instance, if we can take the interrupt when read_low32() has carry from bit 28 to bit 29, then we can piece together code which can work as long as GetTick() isn't delayed by more than 3-4 days. This require GetTick() to change using code given under #4 below. 3) Forget the hardware counter: just take an interrupt every 1ms, and increment a global variable uint64_t ticks64 on each interrupt, and then GetTick just returns ticks64. This only works if the CPU hardware supports atomic 64-bit accesses. It's not generally possible to write C code for a 32-bit processor which can guarantee 64-bit atomic ops, so it's best to have the interrupt handler deal with two 32-bit variables ticks_low and ticks_high, and then you still need the GetTicks() to have a while loop to read the two variables. 4) Use a regular existing interrupt which occurs at any rate, as long as it's well over 1ms, and well under 49 days. Let's assume you have a 1-second interrupt. This can be asynchronous to the 1ms timer. In that interrupt handler, you sample the 32-bit hardware counter, and if you notice it wrapping (previous read value > new value), increment ticks_high. You need to update the global volatile variable ticks_low as well as the current hw count. And this interrupt handler needs to be the only code changing ticks_low and ticks_high. Then, GetTick() does the following: uint32_t local_ticks_low, local_ticks_high; [ while loop to read valid ticks_low and ticks_high value into the local_* variables ] uint64_t ticks64 = ((uint64_t)local_ticks_high << 32) | local_ticks_low; ticks64 += (int32_t)(read_low32() - local_ticks_low); return ticks64; Basically, we return the ticks64 from the last regular interrupt, which could be 1 second ago, and we add in the small delta from reading the hw counter. Again, this requires the 1-second interrupt to be guaranteed to happen before we get close to 49 days since the last 1-second interrupt (if it's really a 1-second interrupt, it easily meets that criteria. If you try to pick something irregular, like a keypress interrupt, then that won't work). It does not depend on the exact rate of the interrupt at all. I wrote it above with extra safety--It subtracts two 32-bit unsigned variables, gets a 32-bit unsigned result, treats that as a 32-bit signed result, and adds that to the 64-bit unsigned ticks count. It's not strictly necessary to do the 32-bit signed result cast: it just makes the code more robust in case the HW timer moves backwards slightly. Imagine some code tries to adjust the current timer value by setting it backwards slightly (say, some code trying to calibrate the timer with the RTC or something). Without the cast to 32-bit signed int, this slight backwards move would result in ticks64 jumping ahead 49 days, which would be bad. In C, this is pretty easy, but it should be carefully commented so no one removes any important casts. Kent
On 2/8/20 12:03 PM, Kent Dickey wrote:
> Shown more explicitly, the following are all valid states (let's assume > ticks_high is 0, read_low32() just ticked to 0xffff_fffe): > > Time read_low32() ticks_high > ------------------------------------------------- > 0 0xffff_fffe 0 > 1ms 0xffff_ffff 0 > 1.99999ms 0xffff_ffff 0 > 2ms 0x0000_0000 0 > Interrupt is sent and is now pending > 2ms+delta 0x0000_0000 1 > > The issue is: what is "delta", and can other code (including your GetTick() > function) run between "2ms" and "2ms+delta"? And the answer is almost > assuredly "yes". This is a problem.
But, as long as the timing is such that we can not do BOTH the read_low32() and the read of ticks_high in that delta, we can't get the wrong number. This is somewhat a function of the processor, and how much the instruction pipeline 'skids' when an interrupt occurs. The processor that he mentioned, A STM32L4R9, which uses an M4 processor, doesn't have this much of a skid, so that can't be a problem unless you do something foolish like disable the interrupts while doing the sequence. If we put a proper barrier instruction between the read low command and the second read high (and we may need that just to avoid getting a cached value that we read from the first read), and declare it as volatile so the compiler doesn't do its own caching, then the problem doesn't occur. Again, not a problem on his processor, as it is a single core processor (but we still need the volatile).
On Saturday, February 8, 2020 at 10:24:33 AM UTC-5, David Brown wrote:
> On 07/02/2020 23:03, Jim Jackson wrote: > >>> A 32 bit counter incremented at 1 kHz will roll over every 3 years or so. > >> > >> 2 ^ 32 / (24 * 60 * 60 * 1000) = 49.71026962... (days) > > > > Ah yes. Early versions of Windows NT. Crashed if they had an uptime > > of 49 and a bit days - I wonder why :-) > > > > Mind you getting early NT to stay up that long without crashing or needing > > a reboot was bloody difficult. > > > > I believe it was Windows 95 that had this problem - and it was not > discovered until about 2005, because no one had kept Windows 95 running > for 49 days.
49 days? You mean 49 minutes?
> Maybe early NT had a similar fault, of course. But people /did/ have NT > running for long uptimes from very early on, so such a bug would have > been found fairly quickly.
Never used NT, but I used W2k and it was great! W2k was widely pirated so MS started a phone home type of licensing with XP which was initially not well received, but over time became accepted. Now people reminisce about the halcyon days of XP. Networking under W2k required a lot of manual setting up. But it was not hard to do. A web site, World of Windows Networking made it easy until it was bought and ruined with advertising and low quality content. Now I have trouble just getting to Win10 computers to share a file directory. -- Rick C. +- Get 1,000 miles of free Supercharging +- Tesla referral code - https://ts.la/richard11209

The 2024 Embedded Online Conference