EmbeddedRelated.com
Blogs
The 2024 Embedded Online Conference

Buffers: underneath the sky-high abstraction of pointers

Yossi KreininOctober 31, 2014

C is considered a "low-level language" because it gives you "bare pointers". You can go as far as run int*p=(int*)0x1234, and see what *p does. A stark contrast to most languages where you can access person.name, but (&person+15)->name is off limits.

In my chip-related work, however, I sometimes wear a hat where our job is done when *(int*)0x1234 works as intented. From that angle, pointers aren't a low-level feature but a sky-high abstraction. It works through the combined efforts of many people in several companies spread all over the globe - not unlike the journey of a web page from a server to a screen.

At a high level, to move data from point A to point B, "you just move it through a bunch of buffers". This basic realization causes me to instinctively consider memory a boring subject - "buffers, buffers, buffers..." However, over the years I learned to appreciate that it's not "just buffers" any more than programmning is "just typing". It gets pretty clever.

So we'll follow the data's journey through those buffers, stopping at some of the interesting ones.

I'm oversimplifying, and I'm not aiming at completeness, at "A General Theory of Memory" of any kind. I just want to talk about a bunch of interesting things/angles that I haven't seen discussed much.

CPU's store buffer

Real comp arch geeks will know this one better than me; perhaps they'll find surprises in the rest.

So: when executing *pointer=value, the CPU will typically try to store the value in the cache. For various reasons, it's not an instantaneous operation.

One such reason is, if the pointer is not cached right now - if the store is a cache miss - then the CPU needs to read old data in order to write the new data. Why? Because a cache line is 32 or 64 bytes wide, and there's just one "valid" bit saying, "all these bytes are valid". So you can't allocate a new cache line, fill in just one 4- or 8-byte word in the much larger cache line and be done. You need to read the rest of the bytes mapping to that cache line - all old data - into the cache. Otherwise, when the cache line is written back to memory, you'd overwrite all that old data with garbage.

So a write miss is like a read miss, and read misses can take many cycles. Generally, the CPU tries to do something useful while waiting for such long operations. With stores, it's usually easy to find useful work to do even before they're done, because who cares when *pointer=value gets done? We'll only care when we read from pointer next time - more often than not, thousands or millions of cycles later.

So stores typically land in a store buffer, and the CPU marches along. The store buffer is a queue drained by the memory system as it handles the stores, and filled at the other end by the CPU as it executes more stores.

Speculation

OK, so we have a queue. Doesn't look like the most exciting creature. One interesting thing about it, however, is its side-effect of enabling speculation:

  • Say we hit a branch depending on variables x and y that we haven't yet computed. We don't want to wait until we do. We want to keep issuing instructions to keep our pipeline busy. The branch predictor speculates that the branch will be taken. So we take it and happily start running instructions - executing them speculatively. Meanwhile, x and y are computed - and oops, we find out that the branch should not have been taken. We now need to undo our work. How do we undo our memory operations? Well, we just throw them out of the store buffer! (Of course we must have previously marked the stores as "speculative", so the memory system won't execute them until we're certain about x and y. The point is, a buffer for memory side-effects is the only practical way to support speculative execution - undoing actual changes to memory is hard and slow.)
  • One form of speculative execution is transactional memory. Here, we speculate that we can run all this here code without anyone touching the stuff we're changing before we're done. If this turns out to be false, again we can undo the writes in the store buffer.
  • This method, together with a method to undo changes to registers, can be used to implement other kinds of speculative execution, such as speculative multithreading (where you try to run several loop iterations in parallel, speculating that they're independent; this, however, gets pretty gnarly and I'm not sure which commercial CPU ever did it.)

Fences

Is there a catch? Well, yeah.

The biggest trouble is, if you tell another CPU, "hey, I computed something in memory - you can now do stuff with it!", but it's still in the store buffer, the other CPU won't see it.

Hence instructions like x86's MFENCE which, once executed, guarantee that "stores are globally visible" - which, in an oversimplified form, means that the store buffer is drained.

Without fences, not only may other processors see your changes late, but they might see them out of order - you change x and then y, but they see y changing first. Why? Say, because y was cached but x wasn't. The store buffer is not necessarily drained in-order - even in a "simple", "in-order" core. This goes to show that "out of order execution" is not a binary thing but a "slippery slope", and that the memory system is often the first thing to become "somewhat out-of-order" - a recurring point in this write-up.

And of course, compilers don't emit fences all over the place - that'd negate all the benefits of store buffers. Only things like threading primitives use fences.

Now, suppose you want to write clever code, where variables are accessed without locking, and without waiting for any sort of civilized, OS- or library-based message or signal or atomic increment telling that they're ready. There are three modes of operation:

  1. Go ahead and do it, wait until it breaks, "fix" it, repeat
  2. Read everything about this stuff, from the complete works of Lesley Lamport to obscure Dr Dobbs' articles about wrapping it all in multiparadigm virtual C++ templates, and then do The Right Thing (tm)
  3. Avoid this shit like the plague, burying synchronization in libraries, and use dynamic instrumentation and/or static analysis to make sure that NOBODY EVER TOUCHES ANYTHING without using the synchronization you standardized on.

My work on checkedthreads is my attempt to popularize option 3; there are other options. 1 is the road to madness. 2 works in theory; in practice, it becomes a variant of 1.

A rug too small

So, the first problem with store buffers is the bugs. Other problems arise is when this rug is not large enough to sweep the problem under:

  • Atomic increments, compare and swap, etc. are slow in part because you need a fence there. You get to observe all the latencies that the store buffer normally helps hide.
  • Even without fences, at some point you may run out of store buffer space. Stores buffers can't be too big because loads on the same CPU need to peek inside. If you do data=*pointer after storing to pointer, you expect data to be the value you just stored. If the store is buffered, the value needs to be fetched from the buffer. If the cache memory is called "level 1", the store buffer is a sort of "level 0" data cache. (There may be more "level 0 caching" - for instance, a small "victim cache" for just-evicted data.) "Level 0 caches" ought to be comparatively tiny - comparing an address to a lot of cached addresses in parallel costs a lot of energy.
  • In particular, it's one reason why I'm personally not keen on hardware transactional memory: for large-enough transactions, you'll fall back on software transactional memory, because the store buffer is guaranteed to overflow. (Unlike normal stores, stores participating in a transaction cannot be drained from the buffer until the transaction is over.) But there could be good counter-arguments against this - transactions are for event-handling systems, not for signal processing or scientific computing systems, so it's not my specialty.

That's about it; quote a lot of ripples from one buffer in the memory system.

!@#$ DRAM and its %^&* buffers

DRAM is one big lie. You can't actually access DRAM cells at the advertised frequency. How do they fake what they don't have? Buffers.

DRAM has huge (2 or 4K) buffers inside it that are accessible at a relatively high frequency, called "active rows". When you access DRAM, it first spends many cycles loading data into that row; it's not that slow because the bandwidth between the row and the DRAM cells is plentiful.

(Actually, it's the DRAM controller on the chip using DRAM that handles this process. DRAM chips can't do squat without a sophisticated external controller - a peculiar design motivated by reasons best explained by EEs. We'll get back to DRAM controllers soon.)

What happens if you access data outside the row? Well, DRAMs have several active rows - like 4 or 8. Each active row keeps 2 or 4K of data from a corresponding bank - a part of the DRAM keeping all the data at a certain range of addresses.

How do addresses map to banks? Typically, if you iterate over data sequentially, then every next row will be in a different bank. There's no shortage of crazier mappings, and there exist people recommending them, but I'm not one of them. Of course the mapping, if configurable, is configured at boot time and never changed - if you change it, the data already in the DRAM turns into a pumpkin.

The switch-bank-every-row mapping is nice because if you iterate over data sequentially, you won't have to wait for the active row to be written back to the DRAM before accessing the next row. That's right, those rows are always written back to the DRAM ("precharged"), even if you didn't write anything - because the DRAM cell charge is drained (so moved, not copied) to the active row. When you're done with the row, you need to move it back, or you'd lose the data.

The end result is a retarded 1-way set-associative cache made of row buffers - with 4-8 huge, 2-4K lines and no dirty bits. What if you access more than 4-8 data streams at a time - not that unlikely for a multi-core chip? Tough luck: you'll spend much of the time missing this cache.

Even when you access data in an active row, you've been had, kinda. Whether you need it or not, you access the data in full bursts - commonly 16, 32 or 64 bytes.

This is one reason cache lines grow larger over time. If your cache has 32-byte lines and your DRAM has 64 byte bursts, what are you gonna do, read 64 bytes and throw 32 away? Now your L2 cache has 64-byte lines under the assumption that, sure, maybe you won't use them all, but if you ought to read them anyway, it's better to bet that you'll need them and keep them.

And now it might make sense to have 64-byte lines in L1 as well: you've read the stuff, you sorta bet that you'll need it, and if you're right, sending it to L1 right now will save you an L1 miss. That may well be worth the coarser granularity of data management in L1 (that is, evicting too much when reading new data.)

(Note that large cache lines and short DRAM bursts is not that bad, but lines shorter than bursts is horrible. Hence it's enough for some kinds of DRAM to have large bursts to force larger lines on everyone.)

DRAM controller: a multi-core chip's life-saving buffer

So all your data is kept in the ghastly DRAM - increasingly more a hard disk than a truly random access memory. (Hard drives, on the other hand, increasingly become more like RAM with the advent of SSDs. Go figure.)

Now you need a DRAM controller. At a minimum, this thing issues read and write commands according to the DRAM protocol - and then keeps track of the active rows, writes them back when necessary, and loads others instead. Basically, it's the DRAM chip's cache controller.

What else can the DRAM controller do? Reorder memory requests. Why? To minimize the cost of DRAM's inherent problems, such as:

  • Bank conflicts: 2 requests want to access the active row. 2 other requests want to access another row in the same bank - and they came first. Serving them first means unnecessary cache misses. Better reorder them and serve the row hits first. (A CPU might do the same to requests in its store buffer - except its cache has multiple ways, so conflicts aren't the disaster they are in DRAM.)
  • Read to write penalty: Believe it or not, this thing will make you wait when you switch from reading to writing. (There are good reasons for it - I just take it a bit too personally is all.) Better reorder the requests so that all reads happen together and then the writes.
  • Write to read penalty: you see where this is going.

ff

Everyone needs an L2

DRAM being what it is forced GPUs to grow L2 caches over time. Note that GPUs usually don't have L1 caches - just registers and explicitly addressed local memories.

In fact, in the original GPU papers about "streaming devices", a point was made that you don't need caches if you process "data streams" - read an image or a bunch of vertices or whatever, do something to that, write back the result.

Well, sure - as long as you manage to make use of the full DRAM burst. For instance, in CUDA, you can access p[i] in your code and i is completely arbitrary - but the advice is, all threads of a warp (a group of 32 threads) should have i's in a contiguos range, such as 64, 65, 66, ...

Why so? Because then the GPU cleverly composes (coalesces) the threads' operations into 1 DRAM burst. Good stuff - if the programmer did the right thing, which does not always occur naturally. If it doesn't, tremendous amounts of DRAM bandwidth are wasted, with a single word used per burst issued.

Worse, this programming model is not future-proof: reading 32 bytes from 32 threads still wastes half the bandwidth on DRAM with 64-byte bursts. (DRAM burst size tends to grow over time, because so does the difference between the interface frequency and the internal frequency.)

Not to worry though - newer GPUs have L2 caches, which act, even if data is never used twice, as a buffer between the DRAM and the GPU registers. You can read your data in any order, and as long as it fits in the cache, you'll never need to read from the same DRAM address twice.

So caches are not just about data reuse and faster access; they're also a way to mitigate DRAM's increasingly lackluster support for truly random access.

Of course you also need L2s for other reasons, such as different stream processors (what GPU should have called "cores" but don't for marketing reasons) reading the same data. Or processing the same megabyte of data 100 times, and not wanting to waste the scarce DRAM bandwidth for reading it more than once.

The upshot is, all those accelerators in today's chips - GPUs, DSPs and the rest of the flying circus - need L2 caches, even if the processor in question is supposedly well-adapted to cacheless life, and feels just fine without an L1 cache. Different accelerators originating in different companies slows things down to some extent, but we'll get there.

Banks

Wait a minute. If I need to read the same megabyte of data 100 times, what's the difference between reading it from the cache and the DRAM? We'd expect DRAM's latency to be higher - but a GPU tolerates high latency, and a CPU might use memory prefetching to mitigate the latency. DRAM's bandwidth, however, is fine, isn't it?

Well, DRAM's bandwidth is better than its horrible latency, but an on-chip cache can have still more bandwidth. First of all, DRAM is outside the chip, and signals traveling outside the chip are costly. This limits the bandwidth of DRAM - and this might change for the better with different physical board design such as 3D stacking - to some extent.

Then there's the "quality" of the bandwidth. DRAM gives you high bandwidth for not-so-random access; caches can support high bandwidth for pretty much random access.

This is particularly important in multi-core scenarios:

  • One core might have enough DRAM bandwidth for its needs, but many cores will fight each other unless the higher-bandwidth cache will provide the data without needing the DRAM.
  • Each of 4 cores accessing data sequentially looks like pretty random access from outside, because they're not in sync, right?

This latter point is worth elaborating on.

Why GPU grew L2 caches

"as far as it's concerned", when it has computed pointer and value. From now on, it's the memory's job to remember that *pointer == value. The CPU

****

The latter journey is much more widely celebrated, however. Everyone knows about TCP, but what's OCP, or AXI, for that matter?

This is my attempt at "rectifying the injustice", following the pointer's journey through a chip - or rather those parts of the journey that I find interesting. Most of my background here is in embedded systems, so I'll describe things happening there, perhaps at the expense of stuff happening elsewhere (say, PCI devices).

Nor will I dwell on physical/EE issues such as the structure of DRAM cells. You can read Ulrich Drepper's "What Every Programmer Should Know About Memory" if you want to learn about this and other things that programmers have no reason to know about memory.

Lastly, I'm not aiming at completeness, at "A General Theory of Memory" of any kind. Rather, it's just a bunch of interesting things that I haven't seen discussed much. (Those things which are commonly discussed elsewhere, such as virtual memory and cache coherence, will be treated more briefly than others.)

If you come away thinking, "here's a load of hairy stuff I didn't know my computer had", I will have achieved my purpose.

Bus protocols

What will happen when a simple CPU - no virtual memory, no operating system - reads from the address 0x1234?

Actually, anything could happen, depending what's mapped at that address:

  • If it's memory, then the bits stored at that address will be sent to the CPU as the data.
  • If it's a timer, then the current cycle number might be the response. Read again and you'll get different data.
  • If it's a piece of hardware resetting the chip, then 0 might be sent as a response, but the CPU probably won't have the time to receive it, because the chip will be reset. (It's a bit barbaric to change things upon a read - generally reads return some status, and it's writes that change things. But this is not a rule but a convention, and there are exceptions - say, reading from a hardware FIFO would dequeue the read item from the FIFO.)

So there could be anything at 0x1234 - memory or nuclear warheads or nasal demons. For the CPU, accessing an address is like calling an object method, data=bus.read(address) or bus.write(data,address), and then anything might happen.

Does the CPU have no idea what's what and just gets aribrary result depending on the bus and chip?

Yes, and no:

  • Indeed the CPU is (most often) designed separately from the rest of the chip, and simply has no way of knowing what's outside.
  • However, most CPUs have data caches. Caches are great for memory. But caches don't work with a timer at 0x1234 - if you cache the value you read from 0x1234, you'll keep getting that time instead of the up-to-date time you'd get if you actually accessed the bus. With a nuclear warhead, if you write the launch commands to the cache, nothing will be launched, except a random subset of commands which will be written to the nuke controller when the cache line is evicted.

***

Interconnects - implementation issues, patent wars

Cachability.

Non memory devices

Response buffering

Banks

Out of order responses

Cache coherence and the scalability of directory-based systems

Virtual memory and TLB misses

CPU - outstanding loads, store buffers and transactional memory/speculative execution

Write combining? ECC?



The 2024 Embedded Online Conference

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: