Accelerators: computer architecture in the post-CPU era

Yossi Kreinin●October 24, 2014

In what follows, I make the case that:

accelerators is where computer architecture develops the most right now
"the twilight of Moore's law" will strengthen this trend
accelerators are very different, but nonetheless have some common traits and issues
the technology, economics and business aspects of accelerators are very different from CPUs'

My work on accelerators is the centerpiece of my career in computing to date. This puts me in a strange position of having an experience that is uniquely valuable - but to a very small group of people, whom my writing might easily fail to reach.

Moreover, I'm not sure where my knowledge falls on the range between a mathematical formula which you learn through reading, and a person's face which you must simply see to know it - and my having seen it before you and writing about it is of little use.

Still, if I learned something, better to write it down than helplessly watch it fade from memory as years go by. If it fails to be useful, I hope it won't fail to entertain (though the subject is dry enough to keep my hopes low.)

***

What's "computer architecture"? For starters, what's a "computer"? What goes into one? What will your effort be primarily expended on when designing one?

Hennessy & Patterson's famous book "Computer Architecture: A Quantiative Approach" implicitly gives an answer to this - it is, first and foremost, a book about CPU design. "Computer" = "CPU + some other stuff".

Chipworks' silicon die shots give a different answer, however. Have a look at Apple's application processors and you'll see how small the CPUs are relatively to the rest - even without subtracting all the RAMs, especially the large level 2 caches.

What's all that other stuff? Some of it is peripheral controllers. Most of it is accelerators:

DSPs for voice and wireless communication
most famously, a GPU for graphics - and some general-purpose computations ("GPGPU")
an ISP ("image signal processor") - typically a big hunk of non-programmable logic taking raw data from the image sensor and doing stuff ranging from color enhancement to face detection

New accelerators are born every year - for instance, a couple of years ago, DSP companies such as CEVA, Tensilica and TI have finally introduced "vision accelerators" - DSPs specifically targeted at image processing and computer vision.

(Why "finally"? I've been working on vision accelerators for more than a decade. When I started, DSP companies claimed that their existing DSPs already excelled at this stuff - or even were explicitly designed for this stuff - and you didn't need anything more special-purpose. These companies then developing "vision accelerators" is the best refutation of their older claims. Made me feel rather smug, I must admit.)

Meanwhile, CPU design has largely peaked - what's next after super-scalar, out-of-order, hyper-threaded CPUs with SIMD extensions, branch prediction, speculative execution and data prefecthing? Moreover, you can license pretty much any kind of CPU design from multiple companies. With the average reported license price of a few cents per chip, what will compel you to design a CPU of your own?

(Apple and Qualcomm reportedly do make their own CPUs - AFAIK paying tens of millions to ARM for the right to not use their designs, but instead reimplement a compatible CPU themselves. Whatever their reasons, most chip companies either lack such reasons, or have pockets too shallow to pay for differentiation in the CPU department.)

Overall, my bet is that you're much more likely to be working on accelerator design than CPU design - either at an accelerator IP company like CEVA or Imagination, or at a chip company like TI, NVIDIA or Qualcomm. And while a better CPU design might net you a 1.2x performance gain over a competitor with a similar power budget, a better accelerator design can give you a 10x gain - albeit on a narrower set of workloads.

So why are accelerators, which use up a growing share of both silicon area and the efforts of chip designers, so underrepresented in the academic subject of "computer architecture"?

In Hennessy's and Patterson's book, GPUs appeared for the first time in the 5th edition from 2011, in the chapter on data parallelism. And while there's an introduction to NVIDIA's GPUs and the CUDA programming model, it's nothing close to the detailed consideration of design alternatives found in the parts about CPUs.

We'll now discuss the technology, economics and business of accelerators. This discussion will hopefully shed some light on the growing gulf between the theory and practice of computer design, and help form an intuition of what these "accelerator" thingies are, as a class of computing machines.

(Economics, and business? Isn't this redundant? Not to me - I think of economics and business as two different angles entirely. Economics is about raising your standard of living. Businesses aim at getting your money. For instance, an economist might blame a monopoly for failing to produce more output at a lower price, as competing companies would. This results in an underproduction of goods the cost of which could be covered at the existing demand level. But this won't prevent a businessman like Peter Thiel from praising monopolies as the best kind of business for the founding businessmen.)

The technology of accelerators: The Undead Supercomputer Society

First, we'll see how the limitations of chip fabrication are making programmable accelerators a necessity, because we run out of other, more attractive options. Then, we'll discuss some ways in which programmable accelerators address these limitations.

The road to programmable accelerators

Originally, Moore's Law plus "Dennardian scaling" were giving you 2x the transistors switching 1.44x faster every 2 years or so. In the 21st century, however, the following factors have stopped improving, in that order:

Peak achievable frequency
Energy efficiency
Transistor price

What was the impact on computer design?

The conventional narrative is, the end of frequency scaling has brought us multi-core CPUs. However, the original Moore's law already made multi-core look worthwhile. You get 2x transistors, each 1.44x faster than the last ones. Why not use them for 2x the CPU cores, each 1.44x faster than the previous cores?

The answer is that a miner won't mine silver until he runs out of gold. If you could use 2x the transistors to create one CPU running say 30% faster than a competitor's 2 CPUs, consumer devices mostly running serial code would prefer your design to the competitor's.

At some point, however, you ran out of ways to increase serial performance with more transistors. After all, serial code is ultimately constrained by dependencies between its operations - there's a minimal number of steps that you must do to step through it. Moore's law, however, kept throwing more and more transistors at you - even after you got close to that minimal number of steps, by running everything you could possibly run in parallel, out-of-order, etc.

Which is why in the embedded space, we were using multi-core CPUs in the early 2000s, and we got "civilized", cache-coherent multi-core as early as 2005 - even though the frequencies of embedded CPUs kept improving (simply because they lagged behind the frequencies of actively cooled, power-hungry desktop CPUs.) You simply had no better use for the transistors than adding cores.

So while the end of frequency scaling forced people to pay attention more quickly - "hey, the only way to get any performance gain is to use more cores!" - and prompted articles like "Your Free Lunch Is Over", Moore's Law plus Dennard's scaling have been pointing there all along: a growing number of cores made from the growing number of transistors. These cores might be "brawny" or "wimpy" - but there will be many.

This is why a CS professor I studied under in the late 90s, a few years before the frequency scaling was over, talked about the demise of exotic supercomputer architectures at the hands of "the killer microprocessor" - but foresaw huge growth in multi-core, multi-box computing. The CPU had the nicest programming model, and a huge market driving continuous improvements, giving you more CPUs per dollar every 2 years. You couldn't compete with that!

That was long before GPU-based supercomputers entered the Top 500 supercomputer list. What happened in the meanwhile?

Then came "post-Dennardian scaling": you could have your 2x transistors, but they'd use 2x more energy if running at their 1.44x faster frequency. Ouch! Now you can't be adding more cores of the same kind. If your energy budget is fixed (you're running off a battery), or if your power budget is fixed (you can't improve heat dissipation - and when can you, really?..), you have no use whatsoever for the extra transistors.

Thus appeared

===

start with "quantitative approach" book (says nothing about accelerators AFAIK - CHECK)

Technology: the undead supercomputer society

blows: frequency, power, price.

multicore, ugly chip (green droid/old gpus), programmable accelerators. NOW you need programmers!

the professor and "the killer microprocessor"

mention Fischer's arch book (mention that it's "too pretty")

programmable accelerators are hell for the same reason that OpenGL is portable but CUDA is not. So the idea is, this stuff will be popular when there's no other way to deliver serious improvements.

Convenience: CPUs, APIs, programmable accelerators. But - efficiency in the current state of things is the opposite.

Economics: performance precludes portability

no quantitative approach!

Business: the rise of undocumented hardware

lock-in: theoretically should go up relatively to CPUs (can't port to another accelerator) but in reality goes down (less code to port, easier to build a performant alternative given simpler architecture and the absence of portability constraints.)

OpenCL goes here as well?

You might also like... (promoted content)

The 2024 Embedded Online Conference - Register Early and Save with Promo Code ER2024!

Check out Memfault's New Sandbox!

Comments

Comments
Write a Comment

Select to add a comment

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers:

Choose a Username

E-Mail (Work, School or ieee)

First Name

Last Name

Employer

Job Title

Country

State

Password

Confirm Password

By checking this box, I agree with the terms of use and privacy policy By checking this box, I consent to receive occasional emails from the *Related sites and their partners. I understand that these emails will only contain relevant information and that I can unsubscribe at any time.

Accelerators: computer architecture in the post-CPU era

Sign in

You might also like...

About Yossi Kreinin

Popular Posts by Yossi Kreinin

Blogs - Hall of Fame

Free PDF Downloads

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group