timestamp in ms and 64-bit counter| page 7

Reply by Robert Wessel ●February 12, 20202020-02-12

On Wed, 12 Feb 2020 16:44:12 -0800 (PST), Rick C
<gnuarm.deletethisbit@gmail.com> wrote:

>On Wednesday, February 12, 2020 at 4:44:13 PM UTC-5, robert...@yahoo.com wrote:

>> No, the idea is to not update the word in memory unless it hasn't been
>> changed.  The classic example is using CAS to add an item to a linked
>> list.  You read the head pointer (that has to happen atomically, but
>> on most CPUs that just requires that it be aligned), construct the new
>> first element (most crucially the next pointer), and then if the head
>> pointer is unchanged, you can replace it with a pointer to the new
>> first item.
>> 
>> If the values are not equal, you don't want to update the head pointer
>> or you'll trash the linked list.  In that case you retry the insertion
>> operation using the new head pointer.
>> 
>> CAS is intended to be safe to use to make that update, as it's atomic
>> - the read of the value in memory, the compare to the old value, and
>> the conditional update form an atomic block, and can't be interrupted
>> or messed with by other CPUs in the system.
>> 
>> CAS is pretty easy to simulate with LL/SC.  In some cases you'd be
>> better off adjusting the algorithm to better use LL/SC.  In this case
>> it depends on how you're accessing the low word of the timer.  If you
>> have only a single threaded of execution, you can fake CAS by
>> disabling interrupts.
>
>Ok, this is more clear now.  Wikipedia explains LL/SC pretty well.  This is actually for multiple CPUs as much as multitasking.  While you can just disable interrupts (assuming you can live with the interrupt latency issues) to make this work with a single CPU, if you are sharing the data structure with other CPUs the bus requires locking while these multiple transactions are happening.  I assume the CPU has a signal to indicate a locked operation is happening to prevent other accesses from getting in and mucking up the works.  
>
>Is there a way to emulate this locking using semaphores?  Someone I know is a big fan of Propeller CPUs which share memory and I don't know if they have such an instruction.  They share memory by interleaving access. 


In the algorithm I suggested, you could just put a mutex around the
sequence that emulates the CAS.  That's safe, since the extension word
is never updated from inside an interrupt handler (unless you actually
intend for that to be possible, such as you were reading the extended
time value from inside and ISR).  Even if that's slow, it's on a leg
of the code that will happen only rarely.

You still need the atomic read of the extension word (although that's
typically a non-issue, especially on a single hardware thread system).

Reply by David Brown ●February 13, 20202020-02-13

On 13/02/2020 01:44, Rick C wrote:

> Ok, this is more clear now.  Wikipedia explains LL/SC pretty well.
> This is actually for multiple CPUs as much as multitasking.  While
> you can just disable interrupts (assuming you can live with the
> interrupt latency issues) to make this work with a single CPU, if you
> are sharing the data structure with other CPUs the bus requires
> locking while these multiple transactions are happening.  I assume
> the CPU has a signal to indicate a locked operation is happening to
> prevent other accesses from getting in and mucking up the works.
> 

Yes, cpus with CAS and other locked instructions (like atomic
read-modify-write sequences) need bus lock signals.  These are quite
easy to work with from the software viewpoint, and a real PITA to
implement efficiently in hardware in a multi-core system with caches.
Thus you get them in architectures like x86 that are designed to be easy
to program, but not in RISC systems that are designed for fast and
efficient implementations.

CAS can be useful even on a single cpu, if you have multiple masters
(DMA, for example).  And CAS or LL/SC can be useful on a single cpu if
you have pre-emptive multi-tasking and don't want to (or can't) disable
interrupts.

On a small processor like yours, disabling interrupts around critical
regions is almost certainly the easiest and most efficient solution.

(If I were making a cpu, I'd like to have a "temporary interrupt
disable" counter as well as a global interrupt disable flag.  I'd have
an instruction to set this counter to perhaps 3 to 7 counts.  That's
enough time to make a CAS, or an atomic read-modify-write.)

> Is there a way to emulate this locking using semaphores?  Someone I
> know is a big fan of Propeller CPUs which share memory and I don't
> know if they have such an instruction.  They share memory by
> interleaving access.
> 

There is a whole field of possibilities with locking, synchronisation
mechanisms, and lock-free algorithms.  Generally speaking, once you have
one synchronisation primitive, you can emulate any others using it - but
the efficiency can vary enormously.

Reply by George Neuner ●February 13, 20202020-02-13

On Sun, 9 Feb 2020 12:17:42 +0100, David Brown
<david.brown@hesbynett.no> wrote:

>On 09/02/2020 07:35, upsidedown@downunder.com wrote:
>> On Sat, 8 Feb 2020 19:57:48 +0100, David Brown
>> <david.brown@hesbynett.no> wrote:
>> 
>>>> Never used NT, but I used W2k and it was great!  W2k was widely
>>>> pirated so MS started a phone home type of licensing with XP which
>>>> was initially not well received, but over time became accepted.  Now
>>>> people reminisce about the halcyon days of XP.
>>>
>>> Did you not use NT 4.0 ?  It was quite solid.  W2K was also good, but XP
>>> took a few service packs before it became reliable enough for serious use.
>> 
>> NT 4.0 solid ??
>> 
>> NT4 moved graphical functions to kernel mode to speed up window
>> updates.
>
>Yes.  And that meant bugs in the graphics drivers could kill the whole 
>system, unlike in NT 3.x.  And bugs in the graphics drivers were 
>certainly not unknown.  However, with a little care it could run 
>reliably for long times.  I don't remember ever having a software or OS 
>related crash or halt on our little NT 4 server.

Ditto.  

I spent ~7 years in a small company as acting network admin in
addition to my regular development work.  I watched over a pair of NT4
servers, a dozen NT4 workstations, and a handful of Win98 machines.

The NT servers never gave any problems.  They ran 24/7 and were
rebooted only to replace a disk or install new software.  We didn't
install all the service packs, so sometimes the servers would run for
more than a year without a reboot.

The workstations only rarely had problems despite being exposed to
software that was being developed on them.  The machines ran 24/7 -
backups done after hours and on weekends.  I can speak only to my own
experience as a developer:  my workstation took a fair amount of abuse
from crashing and otherwise misbehaving software, but generally it was
rock solid and would run for months without something happening that
required a reboot to fix.

>> In general, each NT4 service pack introduced new bugs and soon the
>> next SP was released to correct the bugs introduced by the previous
>> SP. Thus every other SPs were actually usable.
>> 
>> Even NT5 beta was more stable than NT4 with most recent SP. NT5 beta
>> was renamed Windows 2000 before final release.
>> 
>
>I certainly liked W2K, and found it quite reliable.  But I still 
>remember NT 4.0 as good too.

In my experience, W2K was a bit flaky until SP2.  After that, it
generally was stable.

Poster "upsidedown" (sorry, don't know your name) was right though
about the NT4 service packs.  In my own experience:
 - the initial OS release was a bit flaky
 - SP1 was stable (at least for English speakers)
 - SP2 was really flaky
 - SP3 was stable
 - SP4 was stable
 - SP5 was a bit flaky
 - SP6 was stable

I have been using Windows since 3.0 (which still ran DOS underneath).
I was quite happy with the reliability of NT4.  I have had far more
problems with "more modern" versions: XP, Win7, and now Win10.

YMMV,
George

Reply by Bernd Linsel ●February 13, 20202020-02-13

Rick C wrote:
> On Wednesday, February 12, 2020 at 4:44:13 PM UTC-5, robert...@yahoo.com wrote:
> 
> Ok, this is more clear now.  Wikipedia explains LL/SC pretty well.  This is actually for multiple CPUs as much as multitasking.  While you can just disable interrupts (assuming you can live with the interrupt latency issues) to make this work with a single CPU, if you are sharing the data structure with other CPUs the bus requires locking while these multiple transactions are happening.  I assume the CPU has a signal to indicate a locked operation is happening to prevent other accesses from getting in and mucking up the works.
> 
> Is there a way to emulate this locking using semaphores?  Someone I know is a big fan of Propeller CPUs which share memory and I don't know if they have such an instruction.  They share memory by interleaving access.
> 
> 
> Custom stack processor, related to the Forth VM.  When designing FPGAs I want a CPU will deterministic timing, so 1 instruction = 1 clock cycle works well.  Interrupt latency is zero or one depending on how you count it.  Next cycle after an unmasked interrupt is asserted fetches the first instruction of the IRQ routine.
> 
> The CPU is not pipelined but the registers are aligned through the architecture to make it decode-execute/fetch rather than fetch-decode-execute.  The fetch only depends on flags and instruction decode so it happens in parallel with the execute as far as timing is concerned.  Someone insisted this was pipelined design because of these parallel parts.
> 
> It's nothing special, YAMC (Yet Another MISC CPU).  I've never spent the time to optimize the design for speed.  Instead I did some work to trying to hybridize the stack design with register-like access to the stack to minimize stack juggling.  Once that happened, the number of instructions for the test case I was using (an IRQ for DDS calculations) dropped by either a third or half, I forget which.  The big stumbling block for me is coming up with software to help write code for it.  lol
> 

One should mention that, at least in ARM and MIPS architectures, LL and 
SC are not implemented with a global lock signal, but instead using 
cache snooping (for uni- and multiprocessing systems).

LL just performs a simple load and additionally locks the (L1) data 
cache line of that address (so that it cannot be replaced until SC or 
another LL).

SC checks if data in that cache line has been modified since the last 
LL; if so, it fails, otherwise it succeeds and writes the datum (whether 
write-through or write-back is depended on CPU cache configuration and 
the virtual address).

An SC instruction targeting an address that hasn't been a LL source 
before always fails and invalidates all LL atomic flags, so that their
corresponding SC's will fail. Thus, an SC to a dummy address is 
exploited to implement synchronization barriers (in addition to cache 
sync instructions).

The possible number of concurrent LL/SC pairs depends on the CPU model, 
most support only 1 pending SC after a LL, some allow up to 8 parallel 
LL/SC pairs (from different cache lines).

Finally, an example: Emulated CAS on a MIPS32 CPU, works independed of
number of processors in the system:

// compare_and_swap
// 	input: a0 = unsigned *p, a1 = unsigned old, a2 = unsigned new
// 	returns: v0 = 1 (success) | 0 (failure), v1 = old value from *p

	.set nomips16, nomicromips, noreorder, nomacro
compare_and_swap:
1:	ll v1, 0(a0)		// load linked from a0+0 in v1
	bne v1, a1, 9f		// if v1 != a1 (old),
				// branch forward to label 9
	move v0, zero		// branch delay slot: load result 0
				// executed "while" taking the branch

	move v0, a2		// load a copy of a2 (new) into v0
	sc v0, 0(a0)		// store conditionally into a0+0
	beq v0, zero, 1b	// if unsuccessful (v0 == 0)
				// retry at label 1
	nop			// branch delay slot: nothing to do

9:	jr ra			// else (v0 == 1) return v0
	nop			// jump delay slot: nothing to do

Ann.: This example could be further optimized for speed trading program 
space, reordering the opcodes so that the preferred case (successful 
CAS) executes linearly and branchless, forward branches likely not 
taken, and backward branches likely taken (usable branch prediction has 
only been introduced at MIPS R8).
L1 cache latency is usually 1 clock, when executing linear code, it is 
hidden by prefetch and pipeline.
A L1 cache line is typically 64 bytes (16 words) wide, i.e. if the CPU 
supports parallel LL/SCs, they must be at least 16 words apart, 
otherwise the SC to the address of the first LL will always fail.

Regards,
Bernd

Previous 5 67Next

timestamp in ms and 64-bit counter

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group