64-bit embedded computing is here and now| page 2

Reply by Dimiter_Popoff ●June 8, 20212021-06-08

On 6/8/2021 23:18, David Brown wrote:
> On 08/06/2021 16:46, Theo wrote:
>> ......
> 
>> Memory bus/cache width
> 
> No, that is not a common way to measure cpu "width", for many reasons.
> A chip is likely to have many buses outside the cpu core itself (and the
> cache(s) may or may not be considered part of the core).  It's common to
> have 64-bit wide buses on 32-bit processors, it's also common to have
> 16-bit external databuses on a microcontroller.  And the cache might be
> 128 bits wide.

I agree with your points and those of Theo, but the cache is basically
as wide as the registers? Logically, that is; a cacheline is several
times that, probably you refer to that.
Not that it makes much of a difference to the fact that 64 bit data
buses/registers in an MCU (apart from FPU registers, 32 bit FPUs are
useless to me) are unlikely to attract much interest, nothing of
significance to be gained as you said.
To me 64 bit CPUs are of interest of course and thankfully there are
some available, but this goes somewhat past what we call  "embedded".
Not long ago in a chat with a guy who knew some of ARM 64 bit I gathered
there is some real mess with their out of order execution, one needs to
do... hmmmm.. "sync", whatever they call it, all the time and there is
a huge performance cost because of that. Anybody heard anything about
it? (I only know what I was told).

Dimiter

Reply by James Brakefield ●June 8, 20212021-06-08

On Tuesday, June 8, 2021 at 3:11:24 PM UTC-5, David Brown wrote:
> On 08/06/2021 21:38, James Brakefield wrote: 
> 
> Could you explain your background here, and what you are trying to get 
> at? That would make it easier to give you better answers.
> > The only thing that will take more than 4GB is video or a day's worth of photos.
> No, video is not the only thing that takes 4GB or more. But it is, 
> perhaps, one of the more common cases. Most embedded systems don't need 
> anything remotely like that much memory - to the nearest percent, 100% 
> of embedded devices don't even need close to 4MB of memory (ram and 
> flash put together).
> > So there is likely to be some embedded aps that need a > 32-bit address space.
> Some, yes. Many, no.
> > Cost, size or storage capacity are no longer limiting factors.
> Cost and size (and power) are /always/ limiting factors in embedded systems.
> > 
> > Am trying to puzzle out what a 64-bit embedded processor should look like.
> There are plenty to look at. There are ARMs, PowerPC, MIPS, RISC-V. 
> And of course there are some x86 processors used in embedded systems.
> > At the low end, yeah, a simple RISC processor.
> Pretty much all processors except x86 and brain-dead old-fashioned 8-bit 
> CISC devices are RISC. Not all are simple.
> > And support for complex arithmetic 
> > using 32-bit floats?
> A 64-bit processor will certainly support 64-bit doubles as well as 
> 32-bit floats. Complex arithmetic is rarely needed, except perhaps for 
> FFT's, but is easily done using real arithmetic. You can happily do 
> 32-bit complex arithmetic on an 8-bit AVR, albeit taking significant 
> code space and run time. I believe the latest gcc for the AVR will do 
> 64-bit doubles as well - using exactly the same C code you would on any 
> other processor.
> > And support for pixel alpha blending using quad 16-bit numbers?
> You would use a hardware 2D graphics accelerator for that, not the 
> processor.
> > 32-bit pointers into the software? 
> >
> With 64-bit processors you usually use 64-bit pointers.

|> Could you explain your background here, and what you are trying to get
at?

Am familiar with embedded systems, image processing and scientific applications.
 Have used a number of 8, 16, 32 and ~64bit processors.   Have also done work in
 FPGAs.  Am semi-retired and when working was always trying to stay ahead of 
 new opportunities and challenges.

Some of my questions/comments belong over at comp.arch

Reply by Dimiter_Popoff ●June 8, 20212021-06-08

On 6/8/2021 22:38, James Brakefield wrote:
> On Tuesday, June 8, 2021 at 2:39:29 AM UTC-5, Don Y wrote:
>> On 6/7/2021 10:59 PM, David Brown wrote:
>>> 8-bit microcontrollers are still far more common than 32-bit devices in
>>> the embedded world (and 4-bit devices are not gone yet). At the other
>>> end, 64-bit devices have been used for a decade or two in some kinds of
>>> embedded systems.
>> I contend that a good many "32b" implementations are really glorified
>> 8/16b applications that exhausted their memory space. I still see lots
>> of designs build on a small platform (8/16b) and augment it -- either
>> with some "memory enhancement" technology or additional "slave"
>> processors to split the binaries. Code increases in complexity but
>> there doesn't seem to be a need for the "work-per-unit-time" to.
>>
>> [This has actually been the case for a long time. The appeal of
>> newer CPUs is often in the set of peripherals that accompany the
>> processor, not the processor itself.]
>>> We'll see 64-bit take a greater proportion of the embedded systems that
>>> demand high throughput or processing power (network devices, hard cores
>>> in expensive FPGAs, etc.) where the extra cost in dollars, power,
>>> complexity, board design are not a problem. They will probably become
>>> more common in embedded Linux systems as the core itself is not usually
>>> the biggest part of the cost. And such systems are definitely on the
>>> increase.
>>>
>>> But for microcontrollers - which dominate embedded systems - there has
>>> been a lot to gain by going from 8-bit and 16-bit to 32-bit for little
>> I disagree. The "cost" (barrier) that I see clients facing is the
>> added complexity of a 32b platform and how it often implies (or even
>> *requires*) a more formal OS underpinning the application. Where you
>> could hack together something on bare metal in the 8/16b worlds,
>> moving to 32 often requires additional complexity in managing
>> mechanisms that aren't usually present in smaller CPUs (caches,
>> MMU/MPU, DMA, etc.) Developers (and their organizations) can't just
>> play "coder cowboy" and coerce the hardware to behaving as they
>> would like. Existing staff (hired with the "bare metal" mindset)
>> are often not equipped to move into a more structured environment.
>>
>> [I can hack together a device to meet some particular purpose
>> much easier on "development hardware" than I can on a "PC" -- simply
>> because there's too much I have to "work around" on a PC that isn't
>> present on development hardware.]
>>
>> Not every product needs a filesystem, network stack, protected
>> execution domains, etc. Those come with additional costs -- often
>> in the form of a lack of understanding as to what the ACTUAL
>> code in your product is doing at any given time. (this isn't the
>> case in the smaller MCU world; it's possible for a developer to
>> have written EVERY line of code in a smaller platform)
>>> cost. There is almost nothing to gain from a move to 64-bit, but the
>>> cost would be a good deal higher.
>> Why is the cost "a good deal higher"? Code/data footprints don't
>> uniformly "double" in size. The CPU doesn't slow down to handle
>> bigger data.
>>
>> The cost is driven by where the market goes. Note how many 68Ks found
>> design-ins vs. the T11, F11, 16032, etc. My first 32b design was
>> physically large, consumed a boatload of power and ran at only a modest
>> improvement (in terms of system clock) over 8b processors of its day.
>> Now, I can buy two orders of magnitude more horsepower PLUS a
>> bunch of built-in peripherals for two cups of coffee (at QTY 1)
>>> So it is not going to happen - at
>>> least not more than a very small and very gradual change.
>> We got 32b processors NOT because the embedded world cried out for
>> them but, rather, because of the influence of the 32b desktop world.
>> We've had 32b processors since the early 80's. But, we've only had
>> PCs since about the same timeframe! One assumes ubiquity in the
>> desktop world would need to happen before any real spillover to embedded.
>> (When the "desktop" was an '11 sitting in a back room, it wasn't seen
>> as ubiquitous.)
>>
>> In the future, we'll see the 64b *phone* world drive the evolution
>> of embedded designs, similarly. (do you really need 32b/64b to
>> make a phone? how much code is actually executing at any given
>> time and in how many different containers?)
>>
>> [The OP suggests MCus with radios -- maybe they'll be cell phone
>> radios and *not* wifi/BLE as I assume he's thinking! Why add the
>> need for some sort of access point to a product's deployment if
>> the product *itself* can make a direct connection??]
>>
>> My current design can't fill a 32b address space (but, that's because
>> I've decomposed apps to the point that they can be relatively small).
>> OTOH, designing a system with a 32b limitation seems like an invitation
>> to do it over when 64b is "cost effective". The extra "baggage" has
>> proven to be relatively insignificant (I have ports of my codebase
>> to SPARC as well as Atom running alongside a 32b ARM)
>>> The OP sounds more like a salesman than someone who actually works with
>>> embedded development in reality.
>> Possibly. Or, just someone that wanted to stir up discussion...
> 
> |> I contend that a good many "32b" implementations are really glorified
> |> 8/16b applications that exhausted their memory space.
> 
> The only thing that will take more than 4GB is video or a day's worth of photos.
> So there is likely to be some embedded aps that need a > 32-bit address space.
> Cost, size or storage capacity are no longer limiting factors.
> 
> Am trying to puzzle out what a 64-bit embedded processor should look like.
> At the low end, yeah, a simple RISC processor.  And support for complex arithmetic
> using 32-bit floats?  And support for pixel alpha blending using quad 16-bit numbers?
> 32-bit pointers into the software?
> 

The real value in 64 bit integer registers and 64 bit address space is
just that, having an orthogonal "endless" space (well I remember some
30 years ago 32 bits seemed sort of "endless" to me...).

Not needing to assign overlapping logical addresses to anything
can make a big difference to how the OS is done.

32 bit FPU seems useless to me, 64 bit is OK. Although 32 FP
*numbers* can be quite useful for storing/passing data.

Dimiter

======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/

Reply by Don Y ●June 8, 20212021-06-08

On 6/8/2021 7:46 AM, Theo wrote:
> David Brown <david.brown@hesbynett.no> wrote:
>> But for microcontrollers - which dominate embedded systems - there has
>> been a lot to gain by going from 8-bit and 16-bit to 32-bit for little
>> cost.  There is almost nothing to gain from a move to 64-bit, but the
>> cost would be a good deal higher.  So it is not going to happen - at
>> least not more than a very small and very gradual change.
> 
> I think there will be divergence about what people mean by an N-bit system:
> 
> Register size
> Unit of logical/arithmetical processing
> Memory address/pointer size
> Memory bus/cache width

(General) Register size is the primary driver.

A processor can have very different "size" subcomponents.
E.g., a Z80 is an 8b processor -- registers are nominally 8b.
However, it support 16b operations -- on register PAIRs
(an implicit acknowledgement that the REGISTER is smaller
than the register pair).  This is common on many smaller
processors.  The address space is 16b -- with a separate 16b
address space for I/Os.  The Z180 extends the PHYSICAL
address space to 20b but the logical address space
remains unchanged at 16b (if you want to specify a physical
address, you must use 20+ bits to represent it -- and invoke
a separate mechanism to access it!).  The ALU is *4* bits.

Cache?  Which one?  I or D?  L1/2/3/?

What about the oddballs -- 12b?  1b?

> I think we will increasingly see parts which have different sizes on one
> area but not the other.
> 
> For example, for doing some kinds of logical operations (eg crypto), having
> 64-bit registers and ALU makes sense, but you might only need kilobytes of
> memory so only have <32 address bits.

That depends on the algorithm chosen and the hardware support available.

> For something else, like a microcontroller that's hung off the side of a
> bigger system (eg the MCU on a PCIe card) you might want the ability to
> handle 64 bit addresses but don't need to pay the price for 64-bit
> registers.
> 
> Or you might operate with 16 or 32 bit wide external RAM chip, but your
> cache could extend that to a wider word width.
> 
> There are many permutations, and I think people will pay the cost where it
> benefits them and not where it doesn't.

But you don't buy MCUs with a-la-carte pricing.  How much does an extra
timer cost me?  What if I want it to also serve as a *counter*?  What
cost for 100K of internal ROM?  200K?

[It would be an interesting exercise to try to do a linear analysis of
product prices with an idea of trying to tease out the "costs" (to
the developer) for each feature in EXISTING products!]

Instead, you see a *price* that is reflective of how widely used the
device happens to be, today.  You are reliant on the preferences of others
to determine which is the most cost effective product -- for *you*.

E.g., most of my devices have no "display" -- yet, the MCU I've chosen
has hardware support for same.  It would obviously cost me more to
select a device WITHOUT that added capability -- because most
purchasers *want* a display (and *they* drive the production economies).

I could, potentially, use a 2A03 for some applications.  But, the "TCO"
of such an approach would exceed that of a 32b (or larger) processor!

[What a crazy world!]

> This is not a new phenomenon, of course.  But for a time all these numbers
> were in the range between 16 and 32 bits, which made 32 simplest all round.
> Just like we previously had various 8/16 hybrids (eg 8 bit datapath, 16 bit
> address) I think we're going to see more 32/64 hybrids.
> 
> Theo
>

Reply by Don Y ●June 8, 20212021-06-08

On 6/8/2021 12:38 PM, James Brakefield wrote:

> |> I contend that a good many "32b" implementations are really glorified
> |> 8/16b applications that exhausted their memory space.
> 
> The only thing that will take more than 4GB is video or a day's worth of photos.

That's not true.  For example, I rely on a "PC" in my current design
to support the RDBMS.  Otherwise, I would have to design a "special
node" (I have a distributed system) that had the resources necessary
to process multiple concurrent queries in a timely fashion; I can
put 100GB of RAM in a PC (whereas my current nodes only have 256MB).

The alternative is to rely on secondary (disk) storage -- which is
even worse!

And "video" is incredibly nondescript.  It conjures ideas of STBs.
Instead, I see a wider range of applications in terms of *vision*.

E.g., let your doorbell camera "notice motion", recognize that
motion as indicative of someone/thing approaching it (e.g.,
a visitor), recognize the face/features of the visitor and
alert you to its presence (if desired).  No need to involve a
cloud service to do this.

[My "doorbell" is a camera/microphone/speaker.  *If* I want to
know that you are present, *it* will tell me.  Or, if told to
do so, will grant you access to the house (even in my absence).
For "undesirables", I'm mounting a coin mechanism adjacent to
the entryway (our front door is protected by a gated porch area):
"Deposit 25c to ring bell.  If we want to talk to you, your
deposit will be refunded.  If *not*, consider that the *cost* of
pestering us!"]

There are surveillance cameras discretely placed around the exterior
of the house (don't want the place to look like a frigging *bank*!).
One of them has a clear view of the mailbox (our mail is delivered
via lettercarriers riding in mail trucks).  Same front door camera
hardware.  But, now:  detect motion; detect motion STOPPING
proximate to mailbox (for a few seconds or more); detect motion
resuming; signal "mail available".  Again, no need to involve a
cloud service to accomplish this.  And, when not watching for mail
delivery, it's performing "general" surveillance -- mail detection
is a "free bonus"!

Imagine designing a vision-based inspection system where you "train"
the CAMERA -- instead of some box that the camera connects to.  And,
the CAMERA signals accept/reject directly.

[I use a boatload of cameras, here; they are cheap sensors -- the
"cost" lies in the signal processing!]

> So there is likely to be some embedded aps that need a > 32-bit address space.
> Cost, size or storage capacity are no longer limiting factors.

No, cost size and storage are ALWAYS limiting factors!

E.g., each of my nodes derive power from the wired network connection.
That puts a practical limit of ~12W on what a node can dissipate.
That has to support the processing core plus any local I/Os!  Note
that dissipated power == heat.  So, one also has to be conscious of
how that heat will affect the devices' environs.

(Yes, there are schemes to increase this to ~100W but now the cost
of providing power -- and BACKUP power -- to a remote device starts
to be a sizeable portion of the product's cost and complexity).

My devices are intended to be "invisible" to the user -- so, they
have to hide *inside* something (most commonly, the walls or
ceiling -- in standard Jboxes for accessibility and Code compliance).
So, that limits their size/volume (mine are about the volume of a
standard duplex receptacle -- 3 cu in -- so fit in even the smallest
of 1G boxes... even pancake boxes!)

They have to be inexpensive so I can justify using LOTS of them
(I will have 240 deployed, here; my industrial beta site will have
over 1000; commercial beta site almost a similar number).  Not only
is the cost of initial acquisition of concern, but also the *perceived*
cost of maintaining the hardware in a functional state (customer
doesn't want to have $10K of spares on hand for rapid incident response
and staff to be able to diagnose and repair/replace "on demand")

In my case, I sidestep the PERSISTENT storage issue by relegating that
to the RDBMS.  In *that* domain, I can freely add spinning rust or
an SSD without complicating the design of the rest of the nodes.
So, "storage" becomes:
- how much do I need for a secure bootstrap
- how much do I need to contain a downloaded (from the RDBMS!) binary
- how much do I need to keep "local runtime resources"
- how much can I exploit surplus capacity *elsewhere* in the system
   to address transient needs

Imagine what it would be like having to replace "worn" SD cards
at some frequency in hundreds of devices scattered around hundreds
of "invisible" places!  Almost as bad as replacing *batteries* in
those devices!

[Have you ever had an SD card suddenly write protect itself?]

> Am trying to puzzle out what a 64-bit embedded processor should look like.

"Should"?  That depends on what you expect it to do for you.
The nonrecurring cost of development will become an ever-increasing
portion of the device's "cost".  If you sell 10K units but spend
500K on development (over its lifetime), you've justification for
spending a few more dollars on recurring costs *if* you can realize
a reduction in development/maintenance costs (because the development
is easier, bugs are fewer/easier to find, etc.)

Developers (and silicon vendors, as Good Business Practice)
will look at their code and see what's "hard" to do, efficiently.
Then, consider mechanisms that could make that easier or
more effective.

I see the addition of hardware features that enhance the robustness
of the software development *process*.  E.g., allowing for compartmentalizing
applications and subsystems more effectively and *efficiently*.

[I put individual objects into their own address space containers
to ensure Object A can't be mangled by Client B (or Object C).  As
a result, talking to an object is expensive because I have to hop
back and forth across that protection boundary.  It's even worse
when the targeted object is located on some other physical node
(as now I have the transport cost to contend with).]

Similarly, making communications more robust.  We already see that
with crypto accelerators.  The idea of device "islands" is
obsolescent.  Increasingly, devices will interact with other
devices to solve problems.  More processing will move to the
edge simply because of scaling issues (I can add more CPUs
far more effectively than I can increase the performance of
a "centralized" CPU; add another sense/control point?  let *it*
bring some processing abilities along with it!).

And, securing the product from tampering/counterfeiting; it seems
like most approaches, to date, have some hidden weakness.  It's hard
to believe hardware can't ameliorate that.  The fact that "obscurity"
is still relied upon by silicon vendors suggests an acknowledgement
of their weaknesses.

Beyond that?  Likely more DSP-related support in the "native"
instruction set (so you can blend operations between conventional
computing needs and signal processing related issues).

And, graphics acceleration as many applications implement user
interfaces in the appliance.

There may be some other optimizations that help with hashing
or managing large "datasets" (without them being considered
formal datasets).

Power management (and measurement) will become increasingly
important (I spend almost as much on the "power supply"
as I do on the compute engine).  Developers will want to be
able to easily ascertain what they are consuming as well
as why -- so they can (dynamically) alter their strategies.
In addition to varying CPU clock frequency, there may be
mechanisms to automatically (!) power down sections of
the die based on observed instruction sequences (instead
of me having to explicitly do so).

[E.g., I shed load when I'm running off backup power.
This involves powering down nodes as well as the "fields"
on selective nodes.  How do I decide *which* load to shed to
gain the greatest benefit?]

Memory management (in the conventional sense) will likely
see more innovation.  Instead of just "settling" for a couple
of page sizes, we might see "adjustable" page sizes.
Or, the ability to specify some PORTION of a *particular*
page as being "valid" -- instead of treating the entire
page as such.

Scheduling algorithms will hopefully get additional
hardware support.  E.g., everything is deadline driven
in my design ("real-time").  So, schedulers are concerned
with evaluating the deadlines of "ready" tasks -- which
can vary, over time, as well as may need further qualification
based on other criteria (e.g., least-slack-time scheduling)

Everything in my system is an *opaque* object on which a
set of POSSIBLE methods that can be invoked.  But, each *Client*
of that object (an Actor may be multiple Clients if it possesses
multiple different Handles to the Object) is constrained as to
which methods can be invoked via a particular Handle.

So, I can (e.g.) create an Authenticator object that has methods like
"set_passphrase" and "test_passphrase" and "invalidate_passphrase".
Yet, no "disclose_passphrase" method (for obvious reasons).
I can create an Interface to one privileged Client that
allows it to *set* a new passphrase.  And, all other Interfaces
(to that Client as well as others!) may all be restricted to
only *testing* the passphrase ("Is it 'foobar'?").  And, I can
limit the number of attempts that you can invoke a particular
method over a particular interface so the OS does the enforcement
instead of relying on the Server to do so.

[What's to stop a Client from hammering on the Server (Authenticator
Object) repeatedly -- invoking test_passphrase with full knowledge
that it doesn't know the correct passhrase:  "Is it 'foobar'?"
"Is it 'foobar'?" "Is it 'foobar'?" "Is it 'foobar'?" "Is it 'foobar'?"
The Client has been enabled to do this; that doesn't mean he can't or
won't abuse it!

Note that unlimited access means the server has to respond to each of
those method invocations.  By contrast, putting a limit on them
means the OS can block the invocation from ever reaching the Object
(and needlessly tying up the Object's resources).  A capabilities
based system that relies on encrypted tokens means the Server has
to decrypt a token in order to determine that it is invalid;
the Server's resources are consumed instead of the Client's]

It takes effort (in the kernel) to verify that a Client *can* access a
particular Object (i.e., has a Handle to it) AND that the Client can
invoke THAT particular Method on that Object via this Handle (bound to
a particular Object *Interface*) as well as verifying the format of
the data, converting to a format suitable for the targeted Object
(which may use a different representational structure) for a
particular Version of the Interface...

I can either skimp on performing some of these checks (and rely
on other mechanisms to ensure the security and reliability of
the codebase -- in the presence of unvetted Actors) or hope
that some hardware mechanism in the processor makes these a bit
easier.

> At the low end, yeah, a simple RISC processor.  And support for complex arithmetic
> using 32-bit floats?  And support for pixel alpha blending using quad 16-bit numbers?
> 32-bit pointers into the software?

I doubt complex arithmetic will have much play.  There might be support for
*building* larger data types (e.g., I use BigRationals which are incredibly
inefficient).  But, the bigger bang will be for operators that allow
tedious/iterative solutions to be implemented in constant time.  This,
for example, is why a hardware multiply (or other FPU capabilities)
is such a win -- consider the amount of code that is replaced by a single
op-code!  Ditto things like "find first set bit", etc.

Why stick with 32b floats when you can likely implement doubles with a bit
more microcode (surely faster than trying to do wider operations built from
narrower ones)?

There's an entirely different mindset when you start thinking in
terms of "bigger processors".  I.e., the folks who see 32b processors as
just *wider* 8/16b processors have typically not made this adjustment.
It's like trying to "sample the carry" in a HLL (common in ASM)
instead of concentrating on what you REALLY want to do and letting
the language make it easier for you to express that.

Expect to see people making leaps forward in terms of what they
expect from the solutions they put forth.  Anything that you could
do with a PC, before, can now be done *in* a handheld flashlight!

Reply by Don Y ●June 8, 20212021-06-08

On 6/8/2021 4:04 AM, David Brown wrote:
> On 08/06/2021 09:39, Don Y wrote:
>> On 6/7/2021 10:59 PM, David Brown wrote:
>>> 8-bit microcontrollers are still far more common than 32-bit devices in
>>> the embedded world (and 4-bit devices are not gone yet).  At the other
>>> end, 64-bit devices have been used for a decade or two in some kinds of
>>> embedded systems.
>>
>> I contend that a good many "32b" implementations are really glorified
>> 8/16b applications that exhausted their memory space.
> 
> Sure.  Previously you might have used 32 kB flash on an 8-bit device,
> now you can use 64 kB flash on a 32-bit device.  The point is, you are
> /not/ going to find yourself hitting GB limits any time soon.  The step

I don't see the "problem" with 32b devices as one of address space limits
(except devices utilizing VMM with insanely large page sizes).  As I said,
in my application, task address spaces are really just a handful of pages.

I *do* see (flat) address spaces that find themselves filling up with
stack-and-heap-per-task, big chunks set aside for "onboard" I/Os,
*partial* address decoding for offboard I/Os, etc.  (i.e., you're
not likely going to fully decode a single address to access a set
of DIP switches as the decode logic is disproportionately high
relative to the functionality it adds)

How often do you see a high-order address line used for kernel/user?
(gee, now your "user" space has been halved)

> from 8-bit or 16-bit to 32-bit is useful to get a bit more out of the
> system - the step from 32-bit to 64-bit is totally pointless for 99.99%
> of embedded systems.  (Even for most embedded Linux systems, you usually
> only have a 64-bit cpu because you want bigger and faster, not because
> of memory limitations.  It is only when you have a big gui with fast
> graphics that 32-bit address space becomes a limitation.)

You're assuming there has to be some "capacity" value to the 64b move.

You might discover that the ultralow power devices (for phones!)
are being offered in the process geometries targeted for the 64b
devices.  Or, that some integrated peripheral "makes sense" for
phones (but not MCUs targeting motor control applications).  Or,
that there are additional power management strategies supported
in the hardware.

In my mind, the distinction brought about by "32b" was more advanced
memory protection/management -- even if not used in a particular
application.  You simply didn't see these sorts of mechanisms
in 8/16b offerings.  Likewise, floating point accelerators.  Working
in smaller processors meant you had to spend extra effort to
bullet-proof your code, economize on math operators, etc.

So, if you wanted the advantages of those (hardware) mechanisms,
you "upgraded" your design to 32b -- even if it didn't need
gobs of address space or generic MIPS.  It just wasn't economical
to bolt on an AM9511 or practical to build a homebrew MMU.

> A 32-bit microcontroller is simply much easier to work with than an
> 8-bit or 16-bit with "extended" or banked memory to get beyond 64 K
> address space limits.

There have been some 8b processors that could seemlessly (in HLL)
handle extended address spaces.  The Z180s were delightfully easy
to use, thusly.  You just had to keep in mind that a "call" to
a different bank was more expensive than a "local" call (though
there were no syntactic differences; the linkage editor and runtime
package made this invisible to the developer).

We were selling products with 128K of DRAM on Z80's back in 1981.
Because it was easier to design THAT hardware than to step up to
a 68K, for example.  (as well as leveraging our existing codebase)
The "video game era" was built on hybridized 8b systems -- even though
you could buy 32b hardware, at the time.  You would be surprised at
the ingenuity of many of those systems in offloading the processor
of costly (time consuming) operations to make the device appear more
powerful than it actually was.

>>> We'll see 64-bit take a greater proportion of the embedded systems that
>>> demand high throughput or processing power (network devices, hard cores
>>> in expensive FPGAs, etc.) where the extra cost in dollars, power,
>>> complexity, board design are not a problem.  They will probably become
>>> more common in embedded Linux systems as the core itself is not usually
>>> the biggest part of the cost.  And such systems are definitely on the
>>> increase.
>>>
>>> But for microcontrollers - which dominate embedded systems - there has
>>> been a lot to gain by going from 8-bit and 16-bit to 32-bit for little
>>
>> I disagree.  The "cost" (barrier) that I see clients facing is the
>> added complexity of a 32b platform and how it often implies (or even
>> *requires*) a more formal OS underpinning the application.
> 
> Yes, that is definitely a cost in some cases - 32-bit microcontrollers
> are usually noticeably more complicated than 8-bit ones.  How
> significant the cost is depends on the balances of the project between
> development costs and production costs, and how beneficial the extra
> functionality can be (like moving from bare metal to RTOS, or supporting
> networking).

I see most 32b designs operating without the benefits that a VMM system
can apply (even if you discount demand paging).  They just want to have
a big address space and not have to dick with "segment registers", etc.
They plow through the learning effort required to configure the device
to move the "extra capabilities" out of the way.  Then, just treat it
like a bigger 8/16 processor.

You can "bolt on" a simple network stack even with a rudimentary RTOS/MTOS.
Likewise, a web server.  Now, you remove the need for graphics and other UI
activities hosted *in* the device.  And, you likely don't need to support
multiple concurrent clients.  If you want to provide those capabilities, do
that *outside* the device (let it be someone else's problem).  And, you gain
"remote access" for free.

Few such devices *need* (or even WANT!) ARP caches, inetd, high performance
stack, file systems, etc.

Given the obvious (coming) push for enhanced security in devices, anything
running on your box that you don't need (or UNDERSTAND!) is likely going to
be pruned off as a way to reduce the attack surface.  "Why is this port open?
What is this process doing?  How robust is the XXX subsystem implementation
to hostile actors in an *unsupervised* setting?"

>>> cost.  There is almost nothing to gain from a move to 64-bit, but the
>>> cost would be a good deal higher.
>>
>> Why is the cost "a good deal higher"?  Code/data footprints don't
>> uniformly "double" in size.  The CPU doesn't slow down to handle
>> bigger data.
> 
> Some parts of code and data /do/ double in size - but not uniformly, of
> course.  But your chip is bigger, faster, requires more power, has wider
> buses, needs more advanced memories, has more balls on the package,
> requires finer pitched pcb layouts, etc.

And has been targeted to a market that is EXTREMELY power sensitive
(phones!).

It is increasingly common for manufacturing technologies to be moving away
from "casual development".  The days of owning your own wave and doing
in-house manufacturing at a small startup are gone.  If you want to
limit yourself to the kinds of products that you CAN (easily) assemble, you
will find yourself operating with a much poorer selection of components
available.  I could fab a PCB in-house and build small runs of prototypes
using the wave and shake-and-bake facilities that we had on hand.  Harder
to do so, nowadays.

This has always been the case.  When thru-hole met SMT, folks had to
either retool to support SMT, or limit themselves to components that
were available in thru-hole packages.  As the trend has always been
for MORE devices to move to newer packaging technologies, anyone
who spent any time thinking about it could read the writing on the wall!
(I bought my Leister in 1988?  Now, I prefer begging favors from
colleagues to get my prototypes assembled!)

I suspect this is why we now see designs built on COTS "modules"
increasingly.  Just like designs using wall warts (so they don't
have to do the testing on their own, internally designed supplies).
It's one of the reasons FOSH is hampered (unlike FOSS, you can't roll
your own copy of a hardware design!)

> In theory, you /could/ make a microcontroller in a 64-pin LQFP and
> replace the 72 MHz Cortex-M4 with a 64-bit ARM core at the same clock
> speed.  The die would only cost two or three times more, and take
> perhaps less than 10 times the power for the core.  But it would be so
> utterly pointless that no manufacturer would make such a device.

This is specious reasoning:  "You could take the die out of a 68K and
replace it with a 64 bit ARM."  Would THAT core cost two or three times more
(do you recall how BIG 68K die were?) and consume 10 times the power?
(it would consume considerably LESS).

The market will drive the cost (power, size, $$$, etc.) of 64b cores
down as they will find increasing use in devices that are size and
power constrained.  There's far more incentive to make a cheap,
low power 64b ARM than there is to make a cheap, low power i686
(or 68K) -- you don't see x86 devices in phones (laptops have bigger
power budgets so less pressure on efficiency).

There's no incentive to making thru-hole versions of any "serious"
processor, today.  Just like you can't find any fabs for DTL devices.
Or 10 & 12" vinyl.  (yeah, you can buy vinyl, today -- at a premium.
And, I suspect you can find someone to package an ARM on a DIP
carrier.  But, each of those are niche markets, not where the
"money lies")

> So a move to 64-bit in practice means moving from a small, cheap,
> self-contained microcontroller to an embedded PC.  Lots of new
> possibilities, lots of new costs of all kinds.

How do you come to that conclusion?  I have a 32b MCU on a board.
And some FLASH and DRAM.  How is that going to change when I
move to a 64b processor?  The 64b devices are also SoCs so
it's not like you suddenly have to add address decoding logic,
a clock generator, interrupt controller, etc.

Will phones suddenly become FATTER to accommodate the extra
hardware needed?  Will they all need bolt on battery boosters?

> Oh, and the cpu /could/ be slower for some tasks - bigger cpus that are
> optimised for throughput often have poorer latency and more jitter for
> interrupts and other time-critical features.

You're cherry picking.  They can also be FASTER for other tasks
and likely will be optimized to justify/exploit those added abilities;
a vendor isn't going to offer a product that is LESS desireable than
his existing products.  An IPv6 stack on a 64b processor is a bit
easier to implement than on 32b.

(remember, ARM is in a LOT of fabs!  That speaks to how ubiquitous
it is!)

>>>   So it is not going to happen - at
>>> least not more than a very small and very gradual change.
>>
>> We got 32b processors NOT because the embedded world cried out for
>> them but, rather, because of the influence of the 32b desktop world.
>> We've had 32b processors since the early 80's.  But, we've only had
>> PCs since about the same timeframe!  One assumes ubiquity in the
>> desktop world would need to happen before any real spillover to embedded.
>> (When the "desktop" was an '11 sitting in a back room, it wasn't seen
>> as ubiquitous.)
> 
> I don't assume there is any direct connection between the desktop world
> and the embedded world - the needs are usually very different.  There is
> a small overlap in the area of embedded devices with good networking and
> a gui, where similarity to the desktop world is useful.

The desktop world inspires the embedded world.  You see what CAN be done
for "reasonable money".

In the 70's, we put i4004's into products because we knew the processing
that was required was "affordable" (at several kilobucks) -- because
we had our own '11 on site.  We leveraged the in-house '11 to compute
"initialization constants" for the needs of specific users (operating
the i4004-based products).  We didn't hesitate to migrate to i8080/85
when they became available -- because the price point was largely
unchanged (from where it had been with the i4004) AND we could skip the
involvement of the '11 in computing those initialization constants!

I watch the prices of the original 32b ARM I chose fall and see that
as an opportunity -- to UPGRADE the capabilities (and future-safeness
of the design).  If I'd assumed $X was a tolerable price, before,
then it likely still is!

> We have had 32-bit microcontrollers for decades.  I used a 16-bit
> Windows system when working with my first 32-bit microcontroller.  But
> at that time, 32-bit microcontrollers cost a lot more and required more
> from the board (external memories, more power, etc.) than 8-bit or
> 16-bit devices.  That has gradually changed with an almost total
> disregard for what has happened in the desktop world.

I disagree.  I recall having to put lots of "peripherals" into
an 8/16b system, external address decoding logic, clock generators,
DRAM controllers, etc.

And, the cost of entry was considerably higher.  Development systems
used to cost tens of kilodollars (Intellec MDS, Zilog ZRDS, Moto
EXORmacs, etc.)  I shared a development system with several other
developers in the 70's -- because the idea of giving each of us our
own was anathema, at the time.

For 35+ years, you could put one on YOUR desk for a few kilobucks.
Now, it's considerably less than that.

You'd have to be blind to NOT think that the components that
are "embedded" in products haven't -- and won't continue -- to
see similar reductions in price and increases in performance.

Do you think the folks making the components didn't anticipate
the potential demand for smaller/faster/cheaper chips?

We've had TCP/IP for decades.  Why is it "suddenly" more ubiquitous
in product offerings?  People *see* what they can do with a technology
in one application domain (e.g., desktop) and extrapolate that to
other, similar application domains (embedded).

I did my first full custom 30+ years ago.  Now, I can buy an off-the-shelf
component and "program" it to get similar functionality (without
involving a service bureau).  Ideas that previously were "gee, if only..."
are now commonplace.

> Yes, the embedded world /did/ cry out for 32-bit microcontrollers for an
> increasing proportion of tasks.  We cried many tears when then
> microcontroller manufacturers offered to give more flash space to their
> 8-bit devices by having different memory models, banking, far jumps, and
> all the other shit that goes with not having a big enough address space.
>   We cried out when we wanted to have Ethernet and the microcontroller
> only had a few KB of ram.  I have used maybe 6 or 8 different 32-bit
> microcontroller processor architectures, and I used them because I
> needed them for the task.  It's only in the past 5+ years that I have
> been using 32-bit microcontrollers for tasks that could be done fine
> with 8-bit devices, but the 32-bit devices are smaller, cheaper and
> easier to work with than the corresponding 8-bit parts.

But that's because your needs evolve and the tools you choose to
use have, as well.

I wanted to build a little line frequency clock to see how well it
could discipline my NTPd.  I've got all these PCs, single board PCs,
etc. lying around.  It was *easier* to hack together a small 8b
processor to do the job -- less hardware to understand, no OS
to get in the way, really simple to put a number on the interrupt
latency that I could expect, no uncertainties about the hardware
that's on the PC, etc.

OTOH, I have a network stack that I wrote for the Z180 decades
ago.  Despite being written in a HLL, it is a bear to deploy and
maintain owing to the tools and resources available in that
platform.  My 32b stack was a piece of cake to write, by comparison!

>> In the future, we'll see the 64b *phone* world drive the evolution
>> of embedded designs, similarly.  (do you really need 32b/64b to
>> make a phone?  how much code is actually executing at any given
>> time and in how many different containers?)
> 
> We will see that on devices that are, roughly speaking, tablets -
> embedded systems with a good gui, a touchscreen, networking.  And that's
> fine.  But these are a tiny proportion of the embedded devices made.

Again, I disagree.  You've already admitted to using 32b processors
where 8b could suffice.  What makes you think you won't be using 64b
processors when 32b could suffice?

It's just as hard for me to prototype a 64b SoC as it is a 32b SoC.
The boards are essentially the same size.  "System" power consumption
is almost identical.  Cost is the sole differentiating factor, today.
History tells us it will be less so, tomorrow.  And, the innovations
that will likely come in that offering will likely exceed the
capabilities (or perceived market needs) of smaller processors.
To say nothing of the *imagined* uses that future developers will
envision!

I can make a camera that "reports to google/amazon" to do motion detection,
remote access, etc.  Or, for virtually the same (customer) dollars, I
can provide that functionality locally.  Would a customer want to add
an "unnecessary" dependency to a solution?  "Tired of being dependant
on Big Brother for your home security needs? ..."  Imagine a 64b SoC
with a cellular radio:  "I'll *call* you when someone comes to the door..."
(or SMS)

I have cameras INSIDE my garage that assist with my parking and
tell me if I've forgotten to close the garage door.  Should I have
google/amazon perform those value-added tasks for me?  Will they
tell me if I've left something in the car's path before I run over it?
Will they turn on the light to make it easier for me to see?
Should I, instead, tether all of those cameras to some "big box"
that does all of that signal processing?  What happens to those
resources when the garage is "empty"??

The "electric eye" (interrupter) that guards against closing the
garage door on a toddler/pet/item in it's path does nothing to
protect me if I leave some portion of the vehicle in the path of
the door (but ABOVE the detection range of the interrupter).
Locating a *camera* on teh side of the doorway lets me detect
if ANYTHING is in the path of the door, regardless of how high
above the old interrupter's position it may be located.

How *many* camera interfaces should the SoC *directly* support?

The number (and type) of applications that can be addressed with
ADDITIONAL *local* smarts/resources is almost boundless.  And, folks
don't have to wait for a cloud supplier (off-site processing) to
decide to offer them.

"Build it and they will come."

[Does your thermostat REALLY need all of that horsepower -- two
processors! -- AND google's server in order to control the HVAC
in your home?  My god, how did that simple bimetallic strip
ever do it??!]

If you move into the commercial/industrial domains, the opportunities
are even more diverse!  (e.g., build a camera that does component inspection
*in* the camera and interfaces to a go/nogo gate or labeller)

Note that none of these applications need a display, touch panel, etc.
What they likely need is low power, small size, connectivity, MIPS and
memory.  The same sorts of things that are common in phones.

>>> The OP sounds more like a salesman than someone who actually works with
>>> embedded development in reality.
>>
>> Possibly.  Or, just someone that wanted to stir up discussion...
> 
> Could be.  And there's no harm in that!

On that, we agree.

Time for ice cream (easiest -- and most enjoyable -- way to lose weight)!

Reply by Paul Rubin ●June 8, 20212021-06-08

James Brakefield <jim.brakefield@ieee.org> writes:
> Am trying to puzzle out what a 64-bit embedded processor should look like.

Buy yourself a Raspberry Pi 4 and set it up to run your fish tank via a
remote web browser.  There's your 64 bit embedded system.

Reply by Don Y ●June 8, 20212021-06-08

On 6/8/2021 3:01 PM, Dimiter_Popoff wrote:

>> Am trying to puzzle out what a 64-bit embedded processor should look like.
>> At the low end, yeah, a simple RISC processor.  And support for complex 
>> arithmetic
>> using 32-bit floats?  And support for pixel alpha blending using quad 16-bit 
>> numbers?
>> 32-bit pointers into the software?
> 
> The real value in 64 bit integer registers and 64 bit address space is
> just that, having an orthogonal "endless" space (well I remember some
> 30 years ago 32 bits seemed sort of "endless" to me...).
> 
> Not needing to assign overlapping logical addresses to anything
> can make a big difference to how the OS is done.

That depends on what you expect from the OS.  If you are
comfortable with the possibility of bugs propagating between
different subsystems, then you can live with a logical address
space that exactly coincides with a physical address space.

But, consider how life was before Windows used compartmentalized
applications (and OS).  How easily it is for one "application"
(or subsystem) to cause a reboot -- unceremoniously.

The general direction (in software development, and, by
association, hardware) seems to be to move away from unrestrained
access to the underlying hardware in an attempt to limit the
amount of damage that a "misbehaving" application can cause.

You see this in languages designed to eliminate dereferencing
pointers, pointer arithmetic, etc.  Languages that claim to
ensure your code can't misbehave because it can only do
exactly what the language allows  (no more injecting ASM
into your HLL code).

I think that because you are the sole developer in your
application, you see a distorted vision of what the rest
of the development world encounters.  Imagine handing your
codebase to a third party.  And, *then* having to come
back to it and fix the things that "got broken".

Or, in my case, allowing a developer to install software
that I have to "tolerate" (for some definition of "tolerate")
without impacting the software that I've already got running.
(i.e., its ok to kill off his application if it is broken; but
he can't cause *my* portion of the system to misbehave!)

> 32 bit FPU seems useless to me, 64 bit is OK. Although 32 FP
> *numbers* can be quite useful for storing/passing data.

32 bit numbers have appeal if you're registers are 32b;
they "fit nicely".  Ditto 64b in 64b registers.

Reply by Don Y ●June 8, 20212021-06-08

On 6/8/2021 1:39 PM, Dimiter_Popoff wrote:

> Not long ago in a chat with a guy who knew some of ARM 64 bit I gathered
> there is some real mess with their out of order execution, one needs to
> do... hmmmm.. "sync", whatever they call it, all the time and there is
> a huge performance cost because of that. Anybody heard anything about
> it? (I only know what I was told).

Many processors support instruction reordering (and many compilers
will reorder the code they generate).  In each case, the reordering
is supposed to preserve semantics.

If the code "just runs" (and is never interrupted nor synchronized
with something else), the result should be the same.

If you want to be able to arbitrarily interrupt an instruction
sequence, then you need to take special measures.  This is why
we have barriers, the ability to flush caches, etc.

For "generic" code, the developer isn't involved with any of this.
Inside the kernel (or device drivers), its often a different
story...

Reply by George Neuner ●June 9, 20212021-06-09

On Tue, 8 Jun 2021 22:11:18 +0200, David Brown
<david.brown@hesbynett.no> wrote:

>Pretty much all processors except x86 and brain-dead old-fashioned 8-bit
>CISC devices are RISC...

It certainly is correct to say of the x86 that its legacy, programmer
visible, instruction set is CISC ... but it is no longer correct to
say that the chip design is CISC.

Since (at least) the Pentium 4 x86 really are a CISC decoder bolted
onto the front of what essentially is a load/store RISC.

"Complex" x86 instructions (in RAM and/or $I cache) are dynamically
translated into equivalent short sequences[*] of RISC-like wide format
instructions which are what actually is executed.  Those sequences
also are stored into a special trace cache in case they will be used
again soon - e.g., in a loop - so they (hopefully) will not have to be
translated again.

[*] Actually, a great many x86 instructions map 1:1 to internal RISC
instructions - only a small percentage of complex x86 instructions
require "emulation" via a sequence of RISC instructions.

>... Not all [RISC] are simple.

Correct.  Every successful RISC CPU has supported a suite of complex
instructions. 

Of course, YMMV. 
George